Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

arxiv: 2510.26083 · v2 · submitted 2025-10-30 · 💻 cs.LG · cs.AI

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Yuhua Jiang , Shuang Cheng , Yihao Liu , Ermo Hua , Che Jiang , Weigao Sun , Yu Cheng , Feifei Gao

show 2 more authors

Biqing Qi Bowen Zhou

This is my paper

Pith reviewed 2026-05-18 03:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords specialized generalist modelstask-aware memory triggerdomain adaptationlarge language modelsMRI reconstructiontest-time adaptationmemory mechanism

0 comments p. Extension

The pith

Nirvana uses a task-aware memory trigger to adapt general models to specialized domains like biomedicine and law at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Nirvana as a specialized generalist model that keeps strong general language abilities while adapting to narrow domains. It does so by treating every input as its own self-supervised fine-tuning task and using the Task-Aware Memory Trigger to adjust relevant parameters dynamically during inference. A Specialized Memory Updater then consolidates the most useful context for that task. The authors report that this yields performance equal to or better than standard large language models on broad benchmarks, the lowest perplexity on biomedicine, finance, and law, and higher-quality MRI reconstructions when lightweight codecs are attached to the frozen backbone. Readers would care because the method offers a single model that handles both everyday and expert tasks without full retraining or separate specialized networks.

Core claim

Nirvana is a Specialized Generalist Model that features specialized memory, linear-time complexity, and test-time task information extraction. Its central components are the Task-Aware Memory Trigger, which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly, and the Specialized Memory Updater, which dynamically consolidates task-relevant context. This design enables Nirvana to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains including biomedicine, finance, and law, and deliver higher-fidelity MRI reconstructions than conventional LLM-based models when lightweight codecs are fine-

What carries the argument

Task-Aware Memory Trigger, which extracts task information at test time and dynamically adjusts task-related parameters to enable on-the-fly specialization without harming general capabilities.

Load-bearing premise

Treating each input as a self-supervised fine-tuning task and adjusting task-related parameters on the fly via the Trigger will produce stable, beneficial specialization without harming general capabilities or introducing instability at test time.

What would settle it

If evaluations on general benchmarks show scores dropping below standard LLM baselines, or if specialized-domain perplexity rises above competing models when the Trigger is active, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2510.26083 by Biqing Qi, Bowen Zhou, Che Jiang, Ermo Hua, Feifei Gao, Shuang Cheng, Weigao Sun, Yihao Liu, Yu Cheng, Yuhua Jiang.

**Figure 2.** Figure 2: MRI reconstruction performance comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: MRI reconstruction performance comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: An example of the overall MRI report generation process. range dependency modeling while maintaining computational efficiency. Moreover, the integration of Trigger endows Nirvana with the capability to self-supervise and adapt on a per-sample basis, addressing distributional shifts without requiring costly backbone retraining. Through extensive experiments on standard language modeling benchmarks, we demon… view at source ↗

**Figure 5.** Figure 5: A toy example for combinatorial tasks of common sense reasoning and key information [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Length extrapolation from 4K to 20K tokens on 3 long benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: MRI reconstruction performances comparison for models with 160M trainable parameters. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) excel at general language tasks but struggle in specialized domains. Specialized Generalist Models (SGMs) address this by preserving broad capabilities while adapting to target domains. However, existing architectures provide limited support for task-guided specialized memory mechanisms. In this work, we introduce Nirvana, an SGM featuring specialized memory, linear-time complexity, and test-time task information extraction. Central to Nirvana are: (1) Task-Aware Memory Trigger ($\textit{Trigger}$), which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly; and (2) Specialized Memory Updater ($\textit{Updater}$), which dynamically consolidates task-relevant context. Nirvana matches or surpasses LLM baselines on general benchmarks and achieves the lowest perplexity across specialized domains including biomedicine, finance, and law. On the challenging task of Magnetic Resonance Imaging (MRI), we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images. Nirvana achieves higher-fidelity reconstructions than conventional LLM-based models, with Trigger providing effective domain-specific adaptation. Ablation studies confirm that removing Trigger leads to substantial degradation across all tasks, underscoring its essential role in task-aware specialization. Models are available at https://huggingface.co/collections/YuhuaJiang/nirvana. Code is available at https://github.com/YuhuaJiang2002/Nirvana.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nirvana adds a task-aware trigger and updater for test-time specialization in LLMs with some benchmark wins and open code, but the MRI setup looks inconsistent with the claimed on-the-fly parameter adjustment.

read the letter

The main thing to know is that this paper introduces Nirvana, which uses a Task-Aware Memory Trigger to treat each input as a self-supervised fine-tuning task and adjust parameters dynamically, paired with a Specialized Memory Updater for consolidating context, all while keeping linear complexity. It reports matching or beating LLM baselines on general tasks and lowest perplexity on specialized domains like biomedicine, finance, and law, plus better MRI reconstructions when attaching codecs to a frozen backbone. Ablations show clear drops without the Trigger. The authors also release models on Hugging Face and code on GitHub, which is useful for checking the work directly. That combination of a new trigger-updater pair and public resources is the concrete addition here. The MRI results are presented as evidence of effective domain adaptation via the Trigger, yet the setup explicitly freezes the backbone and only tunes lightweight codecs. This creates a tension with the core description of on-the-fly parameter adjustment in the Trigger, since frozen weights would seem to limit what can be adjusted at test time. If the adaptation benefit is really coming from the codecs instead, the ablation and central claim need tighter wording to match the experiment. The rest of the empirical story looks standard but solid enough for the claims made. This is the kind of paper that would interest people building or adapting LLMs for high-stakes domains who want lighter test-time methods rather than full retraining. A reader focused on memory mechanisms or efficient specialization could pull some practical ideas from the architecture and ablations. It has enough benchmarks, an ablation, and open artifacts to deserve a serious referee, though the frozen-backbone case will need explicit clarification to hold up under review. I would send it to peer review with a note to resolve how the Trigger operates when the backbone is frozen.

Referee Report

1 major / 2 minor

Summary. The paper introduces Nirvana, a Specialized Generalist Model featuring a Task-Aware Memory Trigger (Trigger) that treats each input as a self-supervised fine-tuning task to adjust task-related parameters on the fly, and a Specialized Memory Updater (Updater) to dynamically consolidate task-relevant context. It claims to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains (biomedicine, finance, law), and deliver higher-fidelity MRI reconstructions by attaching and fine-tuning lightweight codecs to a frozen Nirvana backbone, with the Trigger credited for effective domain-specific adaptation. Ablation studies report substantial degradation when the Trigger is removed.

Significance. If the central claims hold after clarification, the work could contribute a practical mechanism for efficient, test-time task-aware specialization in large models while retaining general capabilities and linear-time complexity. The public release of models on Hugging Face and code on GitHub is a positive step for reproducibility.

major comments (1)

[Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.

minor comments (2)

[Abstract] The abstract reports benchmark wins, lowest perplexity, and ablation degradation but provides no architecture diagrams, training details, hyperparameter settings, or statistical significance tests for the improvements.
Notation for Trigger and Updater is introduced with italics but could be more consistently defined when first used in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comment highlights an important point of potential ambiguity in the abstract's description of the MRI experiments. We address it directly below and have prepared revisions to improve clarity while preserving the integrity of our claims.

read point-by-point responses

Referee: [Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.

Authors: We agree that the abstract wording could lead to misinterpretation and thank the referee for identifying this. In the Nirvana design, the Task-Aware Memory Trigger is implemented as a separate, lightweight module that operates on task embeddings extracted from the input; it does not modify the frozen backbone weights. During the MRI experiments, each reconstruction input is still processed by the Trigger, which treats the task as self-supervised and dynamically updates task-specific memory parameters on the fly. The codecs are additionally fine-tuned for the k-space-to-image mapping, but ablation results (reported in the main text) demonstrate that removing the Trigger causes substantial performance drops even when codecs remain. Thus the reported fidelity gains are not attributable solely to the codecs. We will revise the abstract to state explicitly that the Trigger remains active and provides the domain-specific adaptation while the backbone stays frozen, and we will add a brief clarifying sentence in the MRI section of the methods. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on external benchmarks and ablations, not self-referential definitions or fitted inputs

full rationale

The paper introduces Nirvana with Task-Aware Memory Trigger and Specialized Memory Updater as architectural components, then reports empirical results: matching LLM baselines on general tasks, lowest perplexity on biomedicine/finance/law, and higher-fidelity MRI reconstructions using a frozen backbone plus fine-tuned codecs. Ablation studies are cited to show degradation when Trigger is removed. No equations, derivations, or first-principles results appear that reduce to their own inputs by construction. Claims are tied to external benchmark comparisons and internal controls rather than any self-definitional loop, fitted-parameter renaming, or load-bearing self-citation chain. The architecture description and experimental outcomes remain self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on two newly introduced architectural components whose internal mechanics and training dynamics are not detailed in the abstract; effectiveness is asserted via empirical results rather than derived from first principles.

axioms (1)

domain assumption Each input can be treated as a self-supervised fine-tuning task that allows on-the-fly adjustment of task-related parameters without destabilizing the model.
This premise underpins the Task-Aware Memory Trigger as stated in the abstract.

invented entities (2)

Task-Aware Memory Trigger (Trigger) no independent evidence
purpose: Treats input as self-supervised fine-tuning task and adjusts task-related parameters on the fly
New component introduced to enable test-time task information extraction and domain adaptation.
Specialized Memory Updater (Updater) no independent evidence
purpose: Dynamically consolidates task-relevant context
New component introduced to manage specialized memory.

pith-pipeline@v0.9.0 · 5812 in / 1399 out tokens · 32317 ms · 2026-05-18T03:01:42.327888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task-Aware Memory Trigger treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly via CL-OGD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
cs.RO 2025-11 unverdicted novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 21 internal anchors

[1]

Just read twice: closing the recall gap for recurrent language models, 2024

Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024

work page 2024
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al. Improving language models by retrieving from trillions of tokens. InProceedings of the 39th International Conference on Machine Learning (ICML). PMLR, 2022

work page 2022
[5]

Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

Yutong Chen, Carola-Bibiane Schönlieb, Pietro Liò, Tim Leiner, Pier Luigi Dragotti, Ge Wang, Daniel Rueckert, David Firmin, and Guang Yang. Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

work page 2022
[6]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Online meta-learning

Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. InInternational conference on machine learning, pages 1920–1930. PMLR, 2019

work page 1920
[9]

Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, and Xiangyu Zhao. Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

work page arXiv 2025
[10]

Accelerated MRI reconstructions via variational network and feature domain learning

Ilias I Giannakopoulos, Matthew J Muckley, Jesi Kim, Matthew Breen, Patricia M Johnson, Yvonne W Lui, and Riccardo Lattanzi. Accelerated MRI reconstructions via variational network and feature domain learning. Scientific Reports, 14(1):10991, 2024

work page 2024
[11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, , et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023

work page arXiv 2023
[13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

A unified model for compressed sensing MRI across undersampling patterns

Armeet Singh Jatyani, Jiayun Wang, Aditi Chandrashekar, Zihui Wu, Miguel Liu-Schiaffini, Bahareh Tolooshams, and Anima Anandkumar. A unified model for compressed sensing MRI across undersampling patterns. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26004–26013, 2025

work page 2025
[15]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, , et al. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

GapFormer: Fast Autoregressive Transformers meet RNNs for Personalized Adaptive Cruise Control

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[18]

Gen- eralization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, and Luke Zettlemoyer. Generalization through memoriza- tion: Nearest neighbor language models.arXiv preprint arXiv:1911.00172, 2019

work page arXiv 1911
[19]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 11 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

work page internal anchor Pith review Pith/arXiv arXiv 2006
[20]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge- intensive nlp.arXiv preprint arXiv:2005.11401, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[21]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, , et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

work page arXiv 2025
[24]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlíˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024
[26]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, and Matteo Grella. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence, apr 2024

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´ n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Cahya Wirawa...

work page 2024
[28]

Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. RWKV-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025

work page arXiv 2025
[29]

HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024
[30]

ArXiv preprint abs/2406.07522 (2024)

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024

work page arXiv 2024
[31]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

Dilbag Singh, Anmol Monga, Hector L de Moura, Xiaoxia Zhang, Marcelo VW Zibetti, and Ravinder R Regatte. Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

work page 2023
[34]

End-to-end variational networks for accelerated MRI reconstruction

Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI reconstruction. In International conference on medical image computing and computer-assisted intervention, pages 64–73. Springer, 2020

work page 2020
[35]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, , et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

work page 2024
[37]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Hashimoto†1 Tatsunori, and Guestrin Carlos. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

work page arXiv 2024
[40]

Meta-learning

Joaquin Vanschoren. Meta-learning. InAutomated machine learning: methods, systems, challenges, pages 35–61. Springer International Publishing Cham, 2019

work page 2019
[41]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[42]

Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

Anna Vettoruzzo, Mohamed-Rafik Bouguelia, Joaquin Vanschoren, Thorsteinn Rögnvaldsson, and KC San- tosh. Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

work page 2024
[43]

Z. Wang, J. Sun, et al. A perspective for adapting generalist ai to specialized medical ai applications and their challenges.npj Digital Medicine, 2025

work page 2025
[44]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[45]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

work page arXiv 2024
[47]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[49]

Tokens-to-token vit: Training vision transformers from scratch on imagenet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InProceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021

work page 2021
[50]

fastmri: An open dataset and benchmarks for accelerated mri,

Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. FastMRI: An open dataset and benchmarks for accelerated MRI.arXiv preprint arXiv:1811.08839, 2018

work page arXiv 2018
[51]

the sky is blue

Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Towards building specialized generalist ai with system 1 and system 2 fusion.arXiv preprint arXiv:2407.08642, 2024. 13 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism Appendix A Appendix A.1 Related Work A.1.1 Hybrid Attention-Recurrent Architectures SambaSamba [ 30] interleaves a simple st...

work page arXiv 2024

[1] [1]

Just read twice: closing the recall gap for recurrent language models, 2024

Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024

work page 2024

[2] [2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al. Improving language models by retrieving from trillions of tokens. InProceedings of the 39th International Conference on Machine Learning (ICML). PMLR, 2022

work page 2022

[5] [5]

Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

Yutong Chen, Carola-Bibiane Schönlieb, Pietro Liò, Tim Leiner, Pier Luigi Dragotti, Ge Wang, Daniel Rueckert, David Firmin, and Guang Yang. Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022

work page 2022

[6] [6]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Online meta-learning

Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. InInternational conference on machine learning, pages 1920–1930. PMLR, 2019

work page 1920

[9] [9]

Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, and Xiangyu Zhao. Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025

work page arXiv 2025

[10] [10]

Accelerated MRI reconstructions via variational network and feature domain learning

Ilias I Giannakopoulos, Matthew J Muckley, Jesi Kim, Matthew Breen, Patricia M Johnson, Yvonne W Lui, and Riccardo Lattanzi. Accelerated MRI reconstructions via variational network and feature domain learning. Scientific Reports, 14(1):10991, 2024

work page 2024

[11] [11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, , et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023

work page arXiv 2023

[13] [13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

A unified model for compressed sensing MRI across undersampling patterns

Armeet Singh Jatyani, Jiayun Wang, Aditi Chandrashekar, Zihui Wu, Miguel Liu-Schiaffini, Bahareh Tolooshams, and Anima Anandkumar. A unified model for compressed sensing MRI across undersampling patterns. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26004–26013, 2025

work page 2025

[15] [15]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, , et al. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

GapFormer: Fast Autoregressive Transformers meet RNNs for Personalized Adaptive Cruise Control

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020

[18] [18]

Gen- eralization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, and Luke Zettlemoyer. Generalization through memoriza- tion: Nearest neighbor language models.arXiv preprint arXiv:1911.00172, 2019

work page arXiv 1911

[19] [19]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 11 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

work page internal anchor Pith review Pith/arXiv arXiv 2006

[20] [20]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge- intensive nlp.arXiv preprint arXiv:2005.11401, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[21] [21]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, , et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025

work page arXiv 2025

[24] [24]

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlíˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024

[26] [26]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, and Matteo Grella. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence, apr 2024

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´ n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Cahya Wirawa...

work page 2024

[28] [28]

Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. RWKV-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025

work page arXiv 2025

[29] [29]

HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024

[30] [30]

ArXiv preprint abs/2406.07522 (2024)

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024

work page arXiv 2024

[31] [31]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

Dilbag Singh, Anmol Monga, Hector L de Moura, Xiaoxia Zhang, Marcelo VW Zibetti, and Ravinder R Regatte. Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023

work page 2023

[34] [34]

End-to-end variational networks for accelerated MRI reconstruction

Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI reconstruction. In International conference on medical image computing and computer-assisted intervention, pages 64–73. Springer, 2020

work page 2020

[35] [35]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, , et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

work page 2024

[37] [37]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Hashimoto†1 Tatsunori, and Guestrin Carlos. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024

work page arXiv 2024

[40] [40]

Meta-learning

Joaquin Vanschoren. Meta-learning. InAutomated machine learning: methods, systems, challenges, pages 35–61. Springer International Publishing Cham, 2019

work page 2019

[41] [41]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

work page 2017

[42] [42]

Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

Anna Vettoruzzo, Mohamed-Rafik Bouguelia, Joaquin Vanschoren, Thorsteinn Rögnvaldsson, and KC San- tosh. Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024

work page 2024

[43] [43]

Z. Wang, J. Sun, et al. A perspective for adapting generalist ai to specialized medical ai applications and their challenges.npj Digital Medicine, 2025

work page 2025

[44] [44]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations (ICLR), 2022

work page 2022

[45] [45]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024

work page arXiv 2024

[47] [47]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023

[49] [49]

Tokens-to-token vit: Training vision transformers from scratch on imagenet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InProceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021

work page 2021

[50] [50]

fastmri: An open dataset and benchmarks for accelerated mri,

Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. FastMRI: An open dataset and benchmarks for accelerated MRI.arXiv preprint arXiv:1811.08839, 2018

work page arXiv 2018

[51] [51]

the sky is blue

Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Towards building specialized generalist ai with system 1 and system 2 fusion.arXiv preprint arXiv:2407.08642, 2024. 13 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism Appendix A Appendix A.1 Related Work A.1.1 Hybrid Attention-Recurrent Architectures SambaSamba [ 30] interleaves a simple st...

work page arXiv 2024