pith. sign in

arxiv: 2605.22416 · v1 · pith:QCFYB4O7new · submitted 2026-05-21 · 💻 cs.LG · cs.DC· cs.PF

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF
keywords hybrid language modelsvirtual memory pagingKV cachestate space modelsinference memory managementMamba-Transformerout-of-memory reductionasymmetric pooling
0
0 comments X

The pith

Asymmetric virtual memory paging keeps KV and SSM caches in separate physical pools behind one virtual address space and migrates capacity only on allocation failure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid models mix attention layers that need growing KV caches with state space model layers that need fixed-size states. Unified memory pools waste space by padding the smaller states up to attention page sizes while static dual pools cannot shift capacity when request patterns change. AVMP presents both cache types through a single virtual address space but keeps them in physically distinct pools and moves capacity between pools only when an allocation would otherwise fail. This design cuts out-of-memory events and raises request throughput on both controlled synthetic loads and real ShareGPT traces. The gains appear through two mechanisms: quicker recovery after pressure and faster allocations when KV caches dominate.

Core claim

The allocator separates KV caches and SSM states into physically distinct pools that share one virtual address space; when either pool runs out, spare capacity is migrated from the other pool, but migration occurs only after an allocation failure so that overall behavior remains deterministic.

What carries the argument

Asymmetric Virtual Memory Paging (AVMP), an allocator that maintains physically separate KV and SSM pools under a unified virtual address space and migrates capacity only on allocation failure.

If this is right

  • Out-of-memory events fall by 7.6 percent across evaluated workloads.
  • Request throughput rises between 1.83x and 13.3x on synthetic workloads and 2.36x on ShareGPT traces.
  • Gains remain statistically significant under paired-bootstrap 95 percent confidence intervals.
  • Phase-time breakdowns separate the benefit into shorter OOM recovery on pressured workloads and faster allocation calls on KV-heavy workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unified-virtual-plus-physical-separation pattern could be applied to other model families that maintain caches with mismatched growth rates.
  • Because migration is lazy and deterministic, the technique may integrate cleanly into existing inference servers without requiring changes to scheduling logic.
  • Pure-Python implementation suggests the approach can be adopted quickly while Triton or CUDA kernels could later reduce migration cost further.

Load-bearing premise

Migration triggered solely on allocation failure will keep overall behavior deterministic and will not introduce hidden latency or correctness issues under realistic request interleaving.

What would settle it

A workload with rapidly interleaving requests that produce different cache-size ratios shows either higher tail latency or non-deterministic outputs after migration.

Figures

Figures reproduced from arXiv: 2605.22416 by An Xuan Nguyen.

Figure 1
Figure 1. Figure 1: AVMP virtual handle resolution. A 32-bit handle [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pool rebalancing state machine. The alloca [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-allocator OOM totals per workload ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wall-clock phase decomposition per (variant, workload) cell ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: 𝑁OOM variance as a function of migration batch size 𝐵 ∈ {1, . . . , 256} (↓ lower is better). Solid lines plot AVMP per workload; dotted reference lines mark the fixed_dual_mr05 static baseline for the same workload. In Stage 1, we sweep the migration_batch_size parameter across 9 values ranging from 1 to 256. The results confirm our hy￾pothesis that migration batch size acts as the dominant performance ax… view at source ↗
Figure 7
Figure 7. Figure 7: Stage 2 threshold sensitivity (↓ lower is better). Bars are total 𝑁OOM across 12 cells × 3 workloads = 36 measure￾ments for each of 4 threshold variants plus the b128 reference. All five bars land at 510.0, confirming the stage-2 null result: threshold tuning within the sampled ranges has no measur￾able effect on OOM count at fixed 𝐵 = 128. future work on AVMP should focus on migration batch size rather th… view at source ↗
read the original abstract

Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Asymmetric Virtual Memory Paging (AVMP) for hybrid Mamba-Transformer models such as Jamba. It separates KV caches (which grow with sequence length) and SSM states (fixed per layer) into distinct physical memory pools under a single virtual address space, performing capacity migration exclusively upon allocation failure to preserve determinism. On an RTX 3060 12GB GPU, AVMP reduces out-of-memory events by 7.6% and improves request throughput by 1.83x–13.3x on 270 synthetic workload cells and by 2.36x on 60 cells of ShareGPT trace replay, with all gains supported by paired-bootstrap 95% confidence intervals. A phase-time breakdown attributes gains to shorter OOM recovery and faster allocations; the implementation is described as pure Python.

Significance. If the empirical results hold under broader conditions, AVMP offers a practical approach to reducing memory fragmentation in hybrid architectures that mix linear-growing and fixed-size caches. The use of bootstrap confidence intervals on both synthetic and real traces strengthens the performance claims. However, the absence of allocator pseudocode, overhead breakdowns, or formal arguments for determinism limits immediate reproducibility and generalization beyond the reported hardware and workloads.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.
  2. [Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.
minor comments (2)
  1. [Abstract] The abstract states 'All gains hold under paired-bootstrap 95% confidence intervals' but does not specify the exact pairing method or number of resamples used.
  2. [Evaluation] Figure or table captions for the phase-time breakdown should explicitly label the two mechanisms (OOM recovery vs. allocation speed) to match the textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on our manuscript. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.

    Authors: We agree that the current evaluation relies on fixed traces, which limits direct evidence for determinism across varying arrival orders. The design ensures determinism because migration is triggered exclusively on allocation failure and follows a fixed policy for selecting source pools and target capacities; the outcome depends only on the sequence of allocation sizes requested, not on wall-clock timing. Different interleavings may change when failures occur but not the final pool sizes for a given total demand. To address the concern directly, we will add a new subsection in the revised Evaluation with experiments that permute batch sizes and interleave patterns drawn from the synthetic workload cells, measuring migration events and confirming that throughput gains remain consistent within the reported bootstrap intervals. revision: yes

  2. Referee: [Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.

    Authors: We acknowledge that the absence of pseudocode and explicit overhead measurements reduces immediate reproducibility. In the revised manuscript we will add an appendix containing pseudocode for the core allocator routines (virtual address mapping, failure-triggered migration, and pool resizing). We will also extend the phase-time breakdown with a table isolating migration costs (host-to-device copies and virtual remapping) from baseline allocation time on the RTX 3060. These additions will allow readers to separate the benefit of the unified address space from the migration mechanism itself while preserving the pure-Python implementation details already described. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system design with direct measurements

full rationale

The paper describes a practical memory allocator for hybrid models that separates KV and SSM caches into distinct physical pools with migration only on allocation failure. All reported improvements (OOM reduction, throughput gains) are presented as direct empirical results from hardware experiments on synthetic cells and ShareGPT traces, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains that reduce the central claims to prior work by the same authors. The determinism claim is a stated design property evaluated experimentally rather than a derived quantity that loops back to itself. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new mathematical axioms or invented physical entities. It relies on standard operating-system virtual-memory assumptions and on the empirical observation that prompt distributions shift between requests.

axioms (2)
  • standard math Standard virtual-memory hardware and OS paging primitives behave as described for unified address spaces.
    Invoked when the allocator presents separate physical pools through one virtual address space.
  • domain assumption Migration cost is incurred only on allocation failure and does not affect correctness or determinism under the tested workloads.
    Stated in the abstract as the trigger condition that keeps behavior deterministic.

pith-pipeline@v0.9.0 · 5752 in / 1407 out tokens · 50667 ms · 2026-05-22T07:23:40.136115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George- Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Ri...

  2. [2]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 16344– 16359. arXiv:2205.14135

  3. [3]

    Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. InProceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2405.21060

  4. [4]

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Ma- habaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. 2024. Hymba: A Hybrid- head Architecture for Small Language Models.arXiv preprint arXiv:2411.13676 (2024). arXiv:2411.13676

  5. [5]

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. 2024. The Zamba2 Suite: Technical Report.arXiv preprint arXiv:2411.15242(2024). arXiv:2411.15242

  6. [6]

    Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023). arXiv:2312.00752

  7. [7]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). ACM. arXiv:2309.06180

  8. [8]

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zus- man, and Yoav Shoham. 2024. Jamba: A Hybrid Transformer-Mam...

  9. [9]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). 118–132. arXiv:2311.18677 doi:10.1109/ISCA59077.2024.00019 10 Asymmetric Virtual ...

  10. [10]

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1. 1133–1150. arXiv:2405.04437 doi:10.1145/3...

  11. [11]

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen

  12. [12]

    InInternational Conference on Learning Representations (ICLR)

    Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling. InInternational Conference on Learning Representations (ICLR). arXiv:2406.07522

  13. [13]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2407.08608

  14. [14]

    ShareGPT. 2023. ShareGPT: Share Your Wildest ChatGPT Conversations with One Click. https://sharegpt.com/. Deprecated public conversation sharing service; accessed 2026-05-20

  15. [15]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA Model. https://github.com/tatsu-lab/stanford_ alpaca

  16. [16]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations (ICLR). arXiv:2309.17453

  17. [17]

    Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, and Jingwen Leng. 2024. vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving.arXiv preprint arXiv:2407.15309(2024). arXiv:2407.15309

  18. [18]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), Vol. 7. arXiv:2501.01005

  19. [19]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). 521–538

  20. [20]

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. arXiv:2306.14048

  21. [21]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2312.07104 11