Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
Pith reviewed 2026-05-18 03:01 UTC · model grok-4.3
The pith
Nirvana uses a task-aware memory trigger to adapt general models to specialized domains like biomedicine and law at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nirvana is a Specialized Generalist Model that features specialized memory, linear-time complexity, and test-time task information extraction. Its central components are the Task-Aware Memory Trigger, which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly, and the Specialized Memory Updater, which dynamically consolidates task-relevant context. This design enables Nirvana to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains including biomedicine, finance, and law, and deliver higher-fidelity MRI reconstructions than conventional LLM-based models when lightweight codecs are fine-
What carries the argument
Task-Aware Memory Trigger, which extracts task information at test time and dynamically adjusts task-related parameters to enable on-the-fly specialization without harming general capabilities.
Load-bearing premise
Treating each input as a self-supervised fine-tuning task and adjusting task-related parameters on the fly via the Trigger will produce stable, beneficial specialization without harming general capabilities or introducing instability at test time.
What would settle it
If evaluations on general benchmarks show scores dropping below standard LLM baselines, or if specialized-domain perplexity rises above competing models when the Trigger is active, the central claim would be falsified.
Figures
read the original abstract
Large Language Models (LLMs) excel at general language tasks but struggle in specialized domains. Specialized Generalist Models (SGMs) address this by preserving broad capabilities while adapting to target domains. However, existing architectures provide limited support for task-guided specialized memory mechanisms. In this work, we introduce Nirvana, an SGM featuring specialized memory, linear-time complexity, and test-time task information extraction. Central to Nirvana are: (1) Task-Aware Memory Trigger ($\textit{Trigger}$), which treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly; and (2) Specialized Memory Updater ($\textit{Updater}$), which dynamically consolidates task-relevant context. Nirvana matches or surpasses LLM baselines on general benchmarks and achieves the lowest perplexity across specialized domains including biomedicine, finance, and law. On the challenging task of Magnetic Resonance Imaging (MRI), we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images. Nirvana achieves higher-fidelity reconstructions than conventional LLM-based models, with Trigger providing effective domain-specific adaptation. Ablation studies confirm that removing Trigger leads to substantial degradation across all tasks, underscoring its essential role in task-aware specialization. Models are available at https://huggingface.co/collections/YuhuaJiang/nirvana. Code is available at https://github.com/YuhuaJiang2002/Nirvana.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Nirvana, a Specialized Generalist Model featuring a Task-Aware Memory Trigger (Trigger) that treats each input as a self-supervised fine-tuning task to adjust task-related parameters on the fly, and a Specialized Memory Updater (Updater) to dynamically consolidate task-relevant context. It claims to match or surpass LLM baselines on general benchmarks, achieve the lowest perplexity across specialized domains (biomedicine, finance, law), and deliver higher-fidelity MRI reconstructions by attaching and fine-tuning lightweight codecs to a frozen Nirvana backbone, with the Trigger credited for effective domain-specific adaptation. Ablation studies report substantial degradation when the Trigger is removed.
Significance. If the central claims hold after clarification, the work could contribute a practical mechanism for efficient, test-time task-aware specialization in large models while retaining general capabilities and linear-time complexity. The public release of models on Hugging Face and code on GitHub is a positive step for reproducibility.
major comments (1)
- [Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.
minor comments (2)
- [Abstract] The abstract reports benchmark wins, lowest perplexity, and ablation degradation but provides no architecture diagrams, training details, hyperparameter settings, or statistical significance tests for the improvements.
- Notation for Trigger and Updater is introduced with italics but could be more consistently defined when first used in the main text.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The major comment highlights an important point of potential ambiguity in the abstract's description of the MRI experiments. We address it directly below and have prepared revisions to improve clarity while preserving the integrity of our claims.
read point-by-point responses
-
Referee: [Abstract (MRI paragraph)] Abstract (MRI paragraph): The manuscript states that 'we attach lightweight codecs to the frozen Nirvana backbone and fine-tune them on paired k-space signals and images' yet attributes the higher-fidelity reconstructions to 'Trigger providing effective domain-specific adaptation.' This creates an internal inconsistency with the core description of Trigger, which 'adjusts task-related parameters on the fly.' Because the backbone is explicitly frozen, it is unclear whether Trigger's adjustment mechanism is active in the MRI experiment or whether the reported benefit derives only from the codecs. This point is load-bearing for the central claim that Trigger enables beneficial specialization without harming general performance, and it directly affects interpretation of the ablation results.
Authors: We agree that the abstract wording could lead to misinterpretation and thank the referee for identifying this. In the Nirvana design, the Task-Aware Memory Trigger is implemented as a separate, lightweight module that operates on task embeddings extracted from the input; it does not modify the frozen backbone weights. During the MRI experiments, each reconstruction input is still processed by the Trigger, which treats the task as self-supervised and dynamically updates task-specific memory parameters on the fly. The codecs are additionally fine-tuned for the k-space-to-image mapping, but ablation results (reported in the main text) demonstrate that removing the Trigger causes substantial performance drops even when codecs remain. Thus the reported fidelity gains are not attributable solely to the codecs. We will revise the abstract to state explicitly that the Trigger remains active and provides the domain-specific adaptation while the backbone stays frozen, and we will add a brief clarifying sentence in the MRI section of the methods. revision: yes
Circularity Check
No circularity: performance claims rest on external benchmarks and ablations, not self-referential definitions or fitted inputs
full rationale
The paper introduces Nirvana with Task-Aware Memory Trigger and Specialized Memory Updater as architectural components, then reports empirical results: matching LLM baselines on general tasks, lowest perplexity on biomedicine/finance/law, and higher-fidelity MRI reconstructions using a frozen backbone plus fine-tuned codecs. Ablation studies are cited to show degradation when Trigger is removed. No equations, derivations, or first-principles results appear that reduce to their own inputs by construction. Claims are tied to external benchmark comparisons and internal controls rather than any self-definitional loop, fitted-parameter renaming, or load-bearing self-citation chain. The architecture description and experimental outcomes remain self-contained without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Each input can be treated as a self-supervised fine-tuning task that allows on-the-fly adjustment of task-related parameters without destabilizing the model.
invented entities (2)
-
Task-Aware Memory Trigger (Trigger)
no independent evidence
-
Specialized Memory Updater (Updater)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Task-Aware Memory Trigger treats each input as a self-supervised fine-tuning task and adjusts task-related parameters on the fly via CL-OGD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Reference graph
Works this paper leans on
-
[1]
Just read twice: closing the recall gap for recurrent language models, 2024
Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models, 2024
work page 2024
-
[2]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al. Improving language models by retrieving from trillions of tokens. InProceedings of the 39th International Conference on Machine Learning (ICML). PMLR, 2022
work page 2022
-
[5]
Yutong Chen, Carola-Bibiane Schönlieb, Pietro Liò, Tim Leiner, Pier Luigi Dragotti, Ge Wang, Daniel Rueckert, David Firmin, and Guang Yang. Ai-based reconstruction for fast MRI—a systematic review and meta-analysis.Proceedings of the IEEE, 110(2):224–245, 2022
work page 2022
-
[6]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. InInternational conference on machine learning, pages 1920–1930. PMLR, 2019
work page 1920
-
[9]
Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, and Xiangyu Zhao. Sliding window attention training for efficient large language models.arXiv preprint arXiv:2502.18845, 2025
-
[10]
Accelerated MRI reconstructions via variational network and feature domain learning
Ilias I Giannakopoulos, Matthew J Muckley, Jesi Kim, Matthew Breen, Patricia M Johnson, Yvonne W Lui, and Riccardo Lattanzi. Accelerated MRI reconstructions via variational network and feature domain learning. Scientific Reports, 14(1):10991, 2024
work page 2024
-
[11]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Neel Guha, , et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.arXiv preprint arXiv:2308.11462, 2023
-
[13]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
A unified model for compressed sensing MRI across undersampling patterns
Armeet Singh Jatyani, Jiayun Wang, Aditi Chandrashekar, Zihui Wu, Miguel Liu-Schiaffini, Bahareh Tolooshams, and Anima Anandkumar. A unified model for compressed sensing MRI across undersampling patterns. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26004–26013, 2025
work page 2025
-
[15]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, , et al. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
GapFormer: Fast Autoregressive Transformers meet RNNs for Personalized Adaptive Cruise Control
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[18]
Gen- eralization through memorization: Nearest neighbor language models
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, and Luke Zettlemoyer. Generalization through memoriza- tion: Nearest neighbor language models.arXiv preprint arXiv:1911.00172, 2019
-
[19]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 11 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[20]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge- intensive nlp.arXiv preprint arXiv:2005.11401, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[21]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, , et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838, 2025
-
[24]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5(5):5, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Guilherme Penedo, Hynek Kydlíˇ cek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024
work page 2024
-
[26]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, and Matteo Grella. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Eagle and finch: RWKV with matrix-valued states and dynamic recurrence, apr 2024
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´ n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Cahya Wirawa...
work page 2024
-
[28]
Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. RWKV-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456, 2025
-
[29]
HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. HGRN2: Gated linear RNNs with state expansion.arXiv preprint arXiv:2404.07904, 2024
-
[30]
ArXiv preprint abs/2406.07522 (2024)
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522, 2024
-
[31]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Dilbag Singh, Anmol Monga, Hector L de Moura, Xiaoxia Zhang, Marcelo VW Zibetti, and Ravinder R Regatte. Emerging trends in fast MRI using deep-learning reconstruction on undersampled k-space data: a systematic review.Bioengineering, 10(9):1012, 2023
work page 2023
-
[34]
End-to-end variational networks for accelerated MRI reconstruction
Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI reconstruction. In International conference on medical image computing and computer-assisted intervention, pages 64–73. Springer, 2020
work page 2020
-
[35]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, , et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 12 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism
work page 2024
-
[37]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Hashimoto†1 Tatsunori, and Guestrin Carlos. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,
Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570, 2024
-
[40]
Joaquin Vanschoren. Meta-learning. InAutomated machine learning: methods, systems, challenges, pages 35–61. Springer International Publishing Cham, 2019
work page 2019
-
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017
work page 2017
-
[42]
Anna Vettoruzzo, Mohamed-Rafik Bouguelia, Joaquin Vanschoren, Thorsteinn Rögnvaldsson, and KC San- tosh. Advances and challenges in meta-learning: A technical review.IEEE transactions on pattern analysis and machine intelligence, 46(7):4763–4779, 2024
work page 2024
-
[43]
Z. Wang, J. Sun, et al. A perspective for adapting generalist ai to specialized medical ai applications and their challenges.npj Digital Medicine, 2025
work page 2025
-
[44]
Rabe, DeLesley Hutchins, and Christian Szegedy
Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[45]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024
-
[47]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[49]
Tokens-to-token vit: Training vision transformers from scratch on imagenet
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. InProceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021
work page 2021
-
[50]
fastmri: An open dataset and benchmarks for accelerated mri,
Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, et al. FastMRI: An open dataset and benchmarks for accelerated MRI.arXiv preprint arXiv:1811.08839, 2018
-
[51]
Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Towards building specialized generalist ai with system 1 and system 2 fusion.arXiv preprint arXiv:2407.08642, 2024. 13 Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism Appendix A Appendix A.1 Related Work A.1.1 Hybrid Attention-Recurrent Architectures SambaSamba [ 30] interleaves a simple st...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.