Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

An Xuan Nguyen

arxiv: 2605.22416 · v1 · pith:QCFYB4O7new · submitted 2026-05-21 · 💻 cs.LG · cs.DC· cs.PF

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

An Xuan Nguyen This is my paper

Pith reviewed 2026-05-22 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF

keywords hybrid language modelsvirtual memory pagingKV cachestate space modelsinference memory managementMamba-Transformerout-of-memory reductionasymmetric pooling

0 comments

The pith

Asymmetric virtual memory paging keeps KV and SSM caches in separate physical pools behind one virtual address space and migrates capacity only on allocation failure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid models mix attention layers that need growing KV caches with state space model layers that need fixed-size states. Unified memory pools waste space by padding the smaller states up to attention page sizes while static dual pools cannot shift capacity when request patterns change. AVMP presents both cache types through a single virtual address space but keeps them in physically distinct pools and moves capacity between pools only when an allocation would otherwise fail. This design cuts out-of-memory events and raises request throughput on both controlled synthetic loads and real ShareGPT traces. The gains appear through two mechanisms: quicker recovery after pressure and faster allocations when KV caches dominate.

Core claim

The allocator separates KV caches and SSM states into physically distinct pools that share one virtual address space; when either pool runs out, spare capacity is migrated from the other pool, but migration occurs only after an allocation failure so that overall behavior remains deterministic.

What carries the argument

Asymmetric Virtual Memory Paging (AVMP), an allocator that maintains physically separate KV and SSM pools under a unified virtual address space and migrates capacity only on allocation failure.

If this is right

Out-of-memory events fall by 7.6 percent across evaluated workloads.
Request throughput rises between 1.83x and 13.3x on synthetic workloads and 2.36x on ShareGPT traces.
Gains remain statistically significant under paired-bootstrap 95 percent confidence intervals.
Phase-time breakdowns separate the benefit into shorter OOM recovery on pressured workloads and faster allocation calls on KV-heavy workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unified-virtual-plus-physical-separation pattern could be applied to other model families that maintain caches with mismatched growth rates.
Because migration is lazy and deterministic, the technique may integrate cleanly into existing inference servers without requiring changes to scheduling logic.
Pure-Python implementation suggests the approach can be adopted quickly while Triton or CUDA kernels could later reduce migration cost further.

Load-bearing premise

Migration triggered solely on allocation failure will keep overall behavior deterministic and will not introduce hidden latency or correctness issues under realistic request interleaving.

What would settle it

A workload with rapidly interleaving requests that produce different cache-size ratios shows either higher tail latency or non-deterministic outputs after migration.

Figures

Figures reproduced from arXiv: 2605.22416 by An Xuan Nguyen.

**Figure 2.** Figure 2: Pool rebalancing state machine. The alloca [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-allocator OOM totals per workload ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Wall-clock phase decomposition per (variant, workload) cell ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: 𝑁OOM variance as a function of migration batch size 𝐵 ∈ {1, . . . , 256} (↓ lower is better). Solid lines plot AVMP per workload; dotted reference lines mark the fixed_dual_mr05 static baseline for the same workload. In Stage 1, we sweep the migration_batch_size parameter across 9 values ranging from 1 to 256. The results confirm our hypothesis that migration batch size acts as the dominant performance ax… view at source ↗

**Figure 7.** Figure 7: Stage 2 threshold sensitivity (↓ lower is better). Bars are total 𝑁OOM across 12 cells × 3 workloads = 36 measurements for each of 4 threshold variants plus the b128 reference. All five bars land at 510.0, confirming the stage-2 null result: threshold tuning within the sampled ranges has no measurable effect on OOM count at fixed 𝐵 = 128. future work on AVMP should focus on migration batch size rather th… view at source ↗

read the original abstract

Hybrid language models like Jamba mix attention layers with State Space Models (SSMs), creating two memory cache types with opposite profiles: Key-Value (KV) caches grow linearly with sequence length, while SSM states stay fixed per layer. Current inference engines handle this poorly. Unified pools pad SSM states to attention page sizes, wasting up to 7.3x capacity. Static dual pools cannot adapt when prompt distributions shift between requests. We present Asymmetric Virtual Memory Paging (AVMP). The allocator separates the two cache types into physically distinct pools behind a unified virtual address space, and migrates capacity between pools when one runs out. Migration triggers only on allocation failure, keeping behavior deterministic. We evaluate AVMP across 270 synthetic cells plus 60 cells of ShareGPT trace replay on an RTX 3060 12GB. Out-of-Memory events drop 7.6% and request throughput improves 1.83x to 13.3x across synthetic workloads and 2.36x on ShareGPT. All gains hold under paired-bootstrap 95% confidence intervals. A phase-time breakdown reveals two distinct mechanisms: shorter OOM recovery on capacity-pressured workloads, and faster allocation calls on KV-heavy workloads. Implementation is pure Python; Triton integration is future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVMP gives a practical allocator for hybrid model caches with reported speedups, but the determinism under interleaved requests looks like a real soft spot.

read the letter

The main takeaway is that AVMP offers a straightforward virtual memory approach for hybrid attention and SSM models by putting their caches in physically separate pools behind a single virtual address space and only moving capacity when an allocation fails. The experiments show meaningful reductions in OOM events and throughput gains on limited hardware. What is new here is the combination of distinct physical pools, unified virtual addressing, and that migration policy triggered solely by failure. The paper does well with its broad evaluation across synthetic workloads and real traces, plus the use of bootstrap confidence intervals to back the claims. Breaking down the time into phases also clarifies the two ways it helps. The soft spots are that there is no pseudocode or allocator details provided, making it tough to assess the exact overheads or reproduce the mechanism. The determinism argument also looks weak under the stress test scenario of interleaved requests, where timing differences could lead to varying migration paths and potential latency variability not tested in the reported setups. This paper targets engineers and researchers working on inference optimization for hybrid language models. A reader dealing with memory constraints on GPUs for mixed cache types would find practical value in the results. It has enough novelty and evidence to merit a serious referee, though it would benefit from more implementation transparency. I recommend putting it through peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces Asymmetric Virtual Memory Paging (AVMP) for hybrid Mamba-Transformer models such as Jamba. It separates KV caches (which grow with sequence length) and SSM states (fixed per layer) into distinct physical memory pools under a single virtual address space, performing capacity migration exclusively upon allocation failure to preserve determinism. On an RTX 3060 12GB GPU, AVMP reduces out-of-memory events by 7.6% and improves request throughput by 1.83x–13.3x on 270 synthetic workload cells and by 2.36x on 60 cells of ShareGPT trace replay, with all gains supported by paired-bootstrap 95% confidence intervals. A phase-time breakdown attributes gains to shorter OOM recovery and faster allocations; the implementation is described as pure Python.

Significance. If the empirical results hold under broader conditions, AVMP offers a practical approach to reducing memory fragmentation in hybrid architectures that mix linear-growing and fixed-size caches. The use of bootstrap confidence intervals on both synthetic and real traces strengthens the performance claims. However, the absence of allocator pseudocode, overhead breakdowns, or formal arguments for determinism limits immediate reproducibility and generalization beyond the reported hardware and workloads.

major comments (2)

[Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.
[Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.

minor comments (2)

[Abstract] The abstract states 'All gains hold under paired-bootstrap 95% confidence intervals' but does not specify the exact pairing method or number of resamples used.
[Evaluation] Figure or table captions for the phase-time breakdown should explicitly label the two mechanisms (OOM recovery vs. allocation speed) to match the textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments on our manuscript. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim that migration triggered only on allocation failure keeps behavior deterministic is not supported by experiments that vary request arrival order, batch sizes, or interleaving patterns; the reported synthetic cells and ShareGPT replay use fixed traces, leaving open whether different migration paths or pool sizes can arise across runs that differ only in timing.

Authors: We agree that the current evaluation relies on fixed traces, which limits direct evidence for determinism across varying arrival orders. The design ensures determinism because migration is triggered exclusively on allocation failure and follows a fixed policy for selecting source pools and target capacities; the outcome depends only on the sequence of allocation sizes requested, not on wall-clock timing. Different interleavings may change when failures occur but not the final pool sizes for a given total demand. To address the concern directly, we will add a new subsection in the revised Evaluation with experiments that permute batch sizes and interleave patterns drawn from the synthetic workload cells, measuring migration events and confirming that throughput gains remain consistent within the reported bootstrap intervals. revision: yes
Referee: [Abstract and Evaluation] Abstract and Evaluation: No pseudocode, allocator algorithm, or breakdown of migration overheads (e.g., copy costs, virtual address remapping) is provided, so the reported throughput gains rest on unreviewed implementation details whose contribution cannot be isolated from the unified virtual address space itself.

Authors: We acknowledge that the absence of pseudocode and explicit overhead measurements reduces immediate reproducibility. In the revised manuscript we will add an appendix containing pseudocode for the core allocator routines (virtual address mapping, failure-triggered migration, and pool resizing). We will also extend the phase-time breakdown with a table isolating migration costs (host-to-device copies and virtual remapping) from baseline allocation time on the RTX 3060. These additions will allow readers to separate the benefit of the unified address space from the migration mechanism itself while preserving the pure-Python implementation details already described. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system design with direct measurements

full rationale

The paper describes a practical memory allocator for hybrid models that separates KV and SSM caches into distinct physical pools with migration only on allocation failure. All reported improvements (OOM reduction, throughput gains) are presented as direct empirical results from hardware experiments on synthetic cells and ShareGPT traces, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains that reduce the central claims to prior work by the same authors. The determinism claim is a stated design property evaluated experimentally rather than a derived quantity that loops back to itself. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper introduces no new mathematical axioms or invented physical entities. It relies on standard operating-system virtual-memory assumptions and on the empirical observation that prompt distributions shift between requests.

axioms (2)

standard math Standard virtual-memory hardware and OS paging primitives behave as described for unified address spaces.
Invoked when the allocator presents separate physical pools through one virtual address space.
domain assumption Migration cost is incurred only on allocation failure and does not affect correctness or determinism under the tested workloads.
Stated in the abstract as the trigger condition that keeps behavior deterministic.

pith-pipeline@v0.9.0 · 5752 in / 1407 out tokens · 50667 ms · 2026-05-22T07:23:40.136115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 10 internal anchors

[1]

Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George- Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Ri...

work page arXiv 2024
[2]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 16344– 16359. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. InProceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Ma- habaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. 2024. Hymba: A Hybrid- head Architecture for Small Language Models.arXiv preprint arXiv:2411.13676 (2024). arXiv:2411.13676

work page arXiv 2024
[5]

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. 2024. The Zamba2 Suite: Technical Report.arXiv preprint arXiv:2411.15242(2024). arXiv:2411.15242

work page arXiv 2024
[6]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023). arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). ACM. arXiv:2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zus- man, and Yoav Shoham. 2024. Jamba: A Hybrid Transformer-Mam...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). 118–132. arXiv:2311.18677 doi:10.1109/ISCA59077.2024.00019 10 Asymmetric Virtual ...

work page doi:10.1109/isca59077.2024.00019 2024
[10]

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1. 1133–1150. arXiv:2405.04437 doi:10.1145/3...

work page doi:10.1145/3669940 2025
[11]

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen

work page
[12]

InInternational Conference on Learning Representations (ICLR)

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling. InInternational Conference on Learning Representations (ICLR). arXiv:2406.07522

work page arXiv
[13]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2407.08608

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

ShareGPT. 2023. ShareGPT: Share Your Wildest ChatGPT Conversations with One Click. https://sharegpt.com/. Deprecated public conversation sharing service; accessed 2026-05-20

work page 2023
[15]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA Model. https://github.com/tatsu-lab/stanford_ alpaca

work page 2023
[16]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations (ICLR). arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, and Jingwen Leng. 2024. vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving.arXiv preprint arXiv:2407.15309(2024). arXiv:2407.15309

work page arXiv 2024
[18]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), Vol. 7. arXiv:2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). 521–538

work page 2022
[20]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. arXiv:2306.14048

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2312.07104 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Aleksandar Botev, Soham De, Samuel L. Smith, Anushan Fernando, George- Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Leonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Ri...

work page arXiv 2024

[2] [2]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 16344– 16359. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Tri Dao and Albert Gu. 2024. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. InProceedings of the 41st International Conference on Machine Learning (ICML). arXiv:2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Ma- habaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. 2024. Hymba: A Hybrid- head Architecture for Small Language Models.arXiv preprint arXiv:2411.13676 (2024). arXiv:2411.13676

work page arXiv 2024

[5] [5]

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. 2024. The Zamba2 Suite: Technical Report.arXiv preprint arXiv:2411.15242(2024). arXiv:2411.15242

work page arXiv 2024

[6] [6]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752(2023). arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). ACM. arXiv:2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedi- gos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avshalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zus- man, and Yoav Shoham. 2024. Jamba: A Hybrid Transformer-Mam...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture (ISCA). 118–132. arXiv:2311.18677 doi:10.1109/ISCA59077.2024.00019 10 Asymmetric Virtual ...

work page doi:10.1109/isca59077.2024.00019 2024

[10] [10]

Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1. 1133–1150. arXiv:2405.04437 doi:10.1145/3...

work page doi:10.1145/3669940 2025

[11] [11]

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen

work page

[12] [12]

InInternational Conference on Learning Representations (ICLR)

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling. InInternational Conference on Learning Representations (ICLR). arXiv:2406.07522

work page arXiv

[13] [13]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2407.08608

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

ShareGPT. 2023. ShareGPT: Share Your Wildest ChatGPT Conversations with One Click. https://sharegpt.com/. Deprecated public conversation sharing service; accessed 2026-05-20

work page 2023

[15] [15]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA Model. https://github.com/tatsu-lab/stanford_ alpaca

work page 2023

[16] [16]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Learning Representations (ICLR). arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, and Jingwen Leng. 2024. vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving.arXiv preprint arXiv:2407.15309(2024). arXiv:2407.15309

work page arXiv 2024

[18] [18]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), Vol. 7. arXiv:2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). 521–538

work page 2022

[20] [20]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. arXiv:2306.14048

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems (NeurIPS). arXiv:2312.07104 11

work page internal anchor Pith review Pith/arXiv arXiv 2024