Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference
Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3
The pith
Asymmetric virtual memory paging keeps KV and SSM caches in separate physical pools behind one virtual address space and migrates capacity only on allocation failure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The allocator separates KV caches and SSM states into physically distinct pools that share one virtual address space; when either pool runs out, spare capacity is migrated from the other pool, but migration occurs only after an allocation failure so that overall behavior remains deterministic.
What carries the argument
Asymmetric Virtual Memory Paging (AVMP), an allocator that maintains physically separate KV and SSM pools under a unified virtual address space and migrates capacity only on allocation failure.
If this is right
- Out-of-memory events fall by 7.6 percent across evaluated workloads.
- Request throughput rises between 1.83x and 13.3x on synthetic workloads and 2.36x on ShareGPT traces.
- Gains remain statistically significant under paired-bootstrap 95 percent confidence intervals.
- Phase-time breakdowns separate the benefit into shorter OOM recovery on pressured workloads and faster allocation calls on KV-heavy workloads.
Where Pith is reading between the lines
- The same unified-virtual-plus-physical-separation pattern could be applied to other model families that maintain caches with mismatched growth rates.
- Because migration is lazy and deterministic, the technique may integrate cleanly into existing inference servers without requiring changes to scheduling logic.
- Pure-Python implementation suggests the approach can be adopted quickly while Triton or CUDA kernels could later reduce migration cost further.
Load-bearing premise
Migration triggered solely on allocation failure will keep overall behavior deterministic and will not introduce hidden latency or correctness issues under realistic request interleaving.
What would settle it
A workload with rapidly interleaving requests that produce different cache-size ratios shows either higher tail latency or non-deterministic outputs after migration.
Figures
read the original abstract
Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Asymmetric Virtual Memory Paging (AVMP) for hybrid Mamba-Transformer models such as Jamba. It separates KV caches (which grow with sequence length) and SSM states (fixed per layer) into distinct physical memory pools under a single virtual address space, performing capacity migration exclusively upon allocation failure to preserve determinism. On an RTX 3060 12GB GPU, AVMP reduces out-of-memory events by 7.6% and improves request throughput by 1.83x–13.3x on 270 synthetic workload cells and by 2.36x on 60 cells of ShareGPT trace replay, with all gains supported by paired-bootstrap 95% confidence intervals. A phase-time breakdown attributes gains to shorter OOM recovery and faster allocations; the implementation is described as pure Python.
Significance. If the empirical results hold under broader conditions, AVMP offers a practical approach to reducing memory fragmentation in hybrid architectures that mix linear-growing and fixed-size caches. The use of bootstrap confidence intervals on both synthetic and real traces strengthens the performance claims. However, the absence of allocator pseudocode, overhead breakdowns, or formal arguments for determinism limits immediate reproducibility and generalization beyond the reported hardware and workloads.
major comments (2)
- [Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.
- [Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.
minor comments (2)
- [Abstract] The abstract states 'All gains hold under paired-bootstrap 95% confidence intervals' but does not specify the exact pairing method or number of resamples used.
- [Evaluation] Figure or table captions for the phase-time breakdown should explicitly label the two mechanisms (OOM recovery vs. allocation speed) to match the textual description.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments on our manuscript. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.
Authors: We agree that the current evaluation relies on fixed traces, which limits direct evidence for determinism across varying arrival orders. The design ensures determinism because migration is triggered exclusively on allocation failure and follows a fixed policy for selecting source pools and target capacities; the outcome depends only on the sequence of allocation sizes requested, not on wall-clock timing. Different interleavings may change when failures occur but not the final pool sizes for a given total demand. To address the concern directly, we will add a new subsection in the revised Evaluation with experiments that permute batch sizes and interleave patterns drawn from the synthetic workload cells, measuring migration events and confirming that throughput gains remain consistent within the reported bootstrap intervals. revision: yes
-
Referee: [Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.
Authors: We acknowledge that the absence of pseudocode and explicit overhead measurements reduces immediate reproducibility. In the revised manuscript we will add an appendix containing pseudocode for the core allocator routines (virtual address mapping, failure-triggered migration, and pool resizing). We will also extend the phase-time breakdown with a table isolating migration costs (host-to-device copies and virtual remapping) from baseline allocation time on the RTX 3060. These additions will allow readers to separate the benefit of the unified address space from the migration mechanism itself while preserving the pure-Python implementation details already described. revision: yes
Circularity Check
No significant circularity; empirical system design with direct measurements
full rationale
The paper describes a practical memory allocator for hybrid models that separates KV and SSM caches into distinct physical pools with migration only on allocation failure. All reported improvements (OOM reduction, throughput gains) are presented as direct empirical results from hardware experiments on synthetic cells and ShareGPT traces, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains that reduce the central claims to prior work by the same authors. The determinism claim is a stated design property evaluated experimentally rather than a derived quantity that loops back to itself. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard virtual-memory hardware and OS paging primitives behave as described for unified address spaces.
- domain assumption Migration cost is incurred only on allocation failure and does not affect correctness or determinism under the tested workloads.
Reference graph
Works this paper leans on
-
[1]
Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George- Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Ri...
-
[2]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 16344– 16359. arXiv:2205.14135
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. InProceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2405.21060
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Ma- habaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. 2024. Hymba: A Hybrid- head Architecture for Small Language Models.arXiv preprint arXiv:2411.13676 (2024). arXiv:2411.13676
- [5]
-
[6]
Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023). arXiv:2312.00752
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). ACM. arXiv:2309.06180
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zus- man, and Yoav Shoham. 2024. Jamba: A Hybrid Transformer-Mam...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). 118–132. arXiv:2311.18677 doi:10.1109/ISCA59077.2024.00019 10 Asymmetric Virtual ...
-
[10]
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1. 1133–1150. arXiv:2405.04437 doi:10.1145/3...
-
[11]
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen
-
[12]
InInternational Conference on Learning Representations (ICLR)
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling. InInternational Conference on Learning Representations (ICLR). arXiv:2406.07522
-
[13]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2407.08608
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
ShareGPT. 2023. ShareGPT: Share Your Wildest ChatGPT Conversations with One Click. https://sharegpt.com/. Deprecated public conversation sharing service; accessed 2026-05-20
work page 2023
- [15]
-
[16]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations (ICLR). arXiv:2309.17453
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, and Jingwen Leng. 2024. vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving.arXiv preprint arXiv:2407.15309(2024). arXiv:2407.15309
-
[18]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), Vol. 7. arXiv:2501.01005
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). 521–538
work page 2022
-
[20]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. arXiv:2306.14048
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2312.07104 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.