pith. machine review for the scientific record. sign in

arxiv: 2512.09427 · v5 · submitted 2025-12-10 · 💻 cs.AR · cs.AI

Recognition: no theorem link

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:50 UTC · model grok-4.3

classification 💻 cs.AR cs.AI
keywords LLM servingmemory allocationKV-cache managementLPDDR acceleratorsgeneration length predictionadaptive bucketingon-demand allocation
0
0 comments X

The pith

ODMA raises KV-cache utilization up to 19% on LPDDR accelerators by predicting generation lengths and dynamically adjusting allocation buckets with a safety pool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets memory management bottlenecks when serving large language models on accelerators whose random-access bandwidth is too low for conventional paging. Static pre-allocation reserves too much space for worst-case lengths and leaves large unused regions, while fine-grained paging destroys bandwidth on LPDDR-class hardware. ODMA therefore predicts each request's output length, places it into an adaptively sized bucket whose boundaries are updated from running histograms, and falls back to a small safety pool when predictions miss. This combination keeps allocations contiguous and low-overhead even when request distributions drift or exhibit heavy tails. Experiments on Alpaca, Google-NQ, and real Cambricon MLU370-X4 hardware show both higher cache occupancy and 23-27% more tokens per second than static baselines.

Core claim

ODMA advances generation-length prediction by addressing distribution drift that invalidates static bucket boundaries and performance fragility under heavy-tailed request patterns. It integrates a lightweight length predictor with adaptive bucket partitioning whose boundaries are dynamically recalibrated via online histograms to maximize utilization, while a fallback safety pool ensures robustness against prediction errors. The resulting contiguous allocations are shown to increase KV-cache utilization by up to 19.25% absolute and throughput by 23-27% over static baselines on Cambricon MLU370-X4 accelerators running DeepSeek-R1-Distill-Qwen-7B.

What carries the argument

Adaptive bucket partitioning whose boundaries are recalibrated online from request-length histograms, backed by a lightweight predictor and a reserved safety pool for mispredictions.

If this is right

  • KV-cache utilization rises by up to 19.25% absolute compared with static worst-case provisioning.
  • Serving throughput increases 23-27% on the same LPDDR hardware and model.
  • Prediction accuracy improves from 98.60% to 99.55% on Alpaca and from 82.68% to 93.36% on Google-NQ.
  • Contiguous allocation becomes feasible on accelerators that cannot tolerate the random-access cost of paging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same predictor-plus-adaptive-bucket pattern could reduce memory waste for other variable-length GPU workloads beyond LLM decoding.
  • Deployments on LPDDR hardware could support larger batch sizes or longer contexts without adding HBM.
  • Integration with request schedulers might further reduce safety-pool pressure by grouping similar-length requests.
  • The technique is likely to remain useful as models grow if the histogram window is tuned to the rate of distribution change.

Load-bearing premise

Online histogram recalibration will keep bucket allocations contiguous and low-overhead even when real production request lengths drift or follow heavy tails without exhausting the safety pool.

What would settle it

A multi-day production request trace exhibiting clear distribution drift where KV-cache utilization and per-request allocation overhead are measured and show no net gain over a static baseline.

Figures

Figures reproduced from arXiv: 2512.09427 by Guoqiang Zou, Hao Zheng, Longxiang Yin, Wanyu Wang, Yinhe Han.

Figure 1
Figure 1. Figure 1: ODMA overview. User prompts are annotated by the Predictor and inserted into a Task Pool. The Scheduler groups tagged tasks into batches and sends them to the Runtime, which interacts with the Allocator. The Allocator (with Cluster Manager and per-device Memory Pools) allocates bucket-tagged contiguous blocks on LPDDR-class accelerators [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generation-length prediction in ODMA. A lightweight encoder maps prompt features and request metadata to a length estimate Lˆ and an uncertainty score u, which together determine the bucket tag used by the allocator. when available, it also incorporates contextual signals like conversation depth. These signals are designed to be cheap to compute and to remain stable across model backends, enabling a predic… view at source ↗
Figure 3
Figure 3. Figure 3: Prediction accuracy: ODMA vs. S3 [5]. Left: Alpaca [10]; Right: Google-NQ [18] [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Throughput (TPS) improvement of ODMA over a static pre-allocation baseline (Cambricon-vLLM). Left: Alpaca; Right: Google-NQ. from 82.68% to 93.36%. These gains enable tighter bucket sizing in our deployment without incurring frequent overflows. 5.2 Throughput and Utilization [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Device-memory utilization with ODMA. Left: Alpaca; Right: Google-NQ. Scheduling/batching/MLaaS. Paella and microsecond-scale preemption consider low-latency multi-tenant GPU scheduling [43, 44]; adaptive batching and cost-aware serving are explored in Batch/MArk [45, 46]; cluster-scale MLaaS studies inform instrumentation and goodput optimization [47]. Hardware and compilers. Ampere A100 and TPU v4 describ… view at source ↗
read the original abstract

Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in production workloads: (i) distribution drift that invalidates static bucket boundaries, and (ii) performance fragility under heavy-tailed request patterns. ODMA integrates a lightweight length predictor with adaptive bucket partitioning and a fallback safety pool. Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization, while the safety pool ensures robustness against prediction errors. On Alpaca and Google-NQ benchmarks, ODMA improves S3's prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators demonstrates that ODMA increases KV-cache utilization by up to 19.25% (absolute) and throughput (TPS) by 23-27% over static baselines, validating the efficacy of predictor-driven contiguous allocation for LPDDR-class devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ODMA, an on-demand memory allocation strategy for LLM serving on LPDDR-class accelerators such as Cambricon MLU. It combines a lightweight generation-length predictor with adaptive bucket partitioning (via online histogram recalibration of boundaries) and a fallback safety pool to handle prediction errors. The approach targets distribution drift and heavy-tailed request patterns that invalidate static allocations. On Alpaca and Google-NQ, ODMA raises S3 prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. In deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, it reports up to 19.25% absolute KV-cache utilization gain and 23-27% throughput (TPS) improvement over static baselines.

Significance. If the performance numbers hold after additional controls, the work addresses a practical gap: efficient contiguous allocation on bandwidth-constrained LPDDR accelerators where HBM-style paging is unsuitable. The online histogram adaptation for drift and the safety-pool robustness mechanism could influence memory managers for production LLM serving on edge or cost-sensitive hardware.

major comments (3)
  1. [Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.
  2. [Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.
  3. [Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.
minor comments (2)
  1. [Abstract] Abstract contains run-on sentences and missing spaces (e.g., 'contiguity,it incurs').
  2. [Throughout] Define all acronyms (TPS, KV-cache, S3) on first use and ensure consistent terminology throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will incorporate the suggested additions and clarifications into the revised manuscript to improve transparency and experimental rigor.

read point-by-point responses
  1. Referee: [Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.

    Authors: We agree that these metrics are important for verifying the overhead claims. In the revised manuscript we will add explicit measurements of safety-pool hit rates, histogram-update latency, and recalibration frequency collected from the DeepSeek-R1-Distill-Qwen-7B deployment runs on Cambricon MLU370-X4. These counters will be reported in the Deployment results section together with the existing utilization and throughput numbers. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.

    Authors: We acknowledge the need for more complete experimental reporting. The revised Experimental evaluation section will specify the number of independent runs performed, include standard deviations or confidence intervals for the TPS measurements, discuss statistical significance, and provide precise details on how the static baselines were configured, including the sizing methodology for worst-case KV-cache provisioning. revision: yes

  3. Referee: [Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.

    Authors: We recognize that ablation and sensitivity results would strengthen the central claim. We will add new analysis in the Prediction and allocation sections that reports the observed frequency of bucket-boundary changes and quantifies the resulting impact on memory bandwidth and allocation contiguity under the heavy-tailed request-length distributions from the Alpaca and Google-NQ workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct hardware measurements

full rationale

The paper reports empirical gains in KV-cache utilization (up to 19.25%) and TPS (23-27%) from deployment measurements on Cambricon MLU370-X4 with DeepSeek-R1-Distill-Qwen-7B, plus accuracy lifts on Alpaca/Google-NQ benchmarks. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs; the online histogram recalibration and safety pool are described as mechanisms whose overhead is validated externally rather than assumed tautologically. No self-citation chains or ansatzes are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of length prediction and dynamic partitioning rather than formal axioms or new physical entities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Generation length distributions in production workloads exhibit drift and heavy tails that static buckets cannot handle
    Invoked to justify the need for online histogram recalibration.

pith-pipeline@v0.9.0 · 5616 in / 1196 out tokens · 41294 ms · 2026-05-16T23:50:54.180680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 8 internal anchors

  1. [1]

    T. B. Brown, B. Mann, N. Ryder, et al. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, 2020, pp. 1877–1901

  2. [2]

    Steiner, M

    L. Steiner, M. Jung, M. Huonker, and N. Wehn. Unveiling the Real Performance of LPDDR5 Memories. arXiv:2209.14756, 2022

  3. [3]

    X. L. Dong, S. Moon, Y . E. Xu, et al. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. InKDD, 2023

  4. [4]

    R. Pope, S. Douglas, A. Chowdhery, et al. Efficiently Scaling Transformer Inference. InMLSys, 2023

  5. [5]

    Jin, C.-F

    Y . Jin, C.-F. Wu, D. Brooks, et al. S3: Increasing GPU Utilization During Generative Inference for Higher Throughput. InNeurIPS, 2023

  6. [6]

    MLU370-X4 Smart Accelerator Card (official product page)

    Cambricon. MLU370-X4 Smart Accelerator Card (official product page). https://www.cambricon.com/, accessed 2025

  7. [7]

    Z. Ye. FlashInfer. GitHub repository, 2024

  8. [8]

    Z. Feng, D. Guo, D. Tang, et al. CodeBERT: A Pre-trained Model for Programming and Natural Languages. arXiv:2002.08155, 2020

  9. [9]

    P. Liu, W. Yuan, J. Fu, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP. ACM Computing Surveys, 55(9):1–35, 2023

  10. [10]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, et al. Stanford Alpaca: An Instruction-Following LLaMA Model. GitHub repository, 2023. 8

  11. [11]

    Lafferty, A

    J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. InICML, 2001

  12. [12]

    Z. Zhou, X. Ning, K. Hong, et al. A Survey on Efficient Inference for Large Language Models.arXiv:2404.14294, 2024

  13. [13]

    Vaswani et al

    A. Vaswani et al. Attention is All You Need. InNeurIPS, 2017

  14. [14]

    W. Kwon, Z. Li, S. Zhuang, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. InSOSP, 2023

  15. [15]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.arXiv:1810.04805, 2018

  16. [16]

    Bengio, A

    Y . Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives.IEEE TPAMI, 35(8):1798–1828, 2013

  17. [17]

    X. Miao, G. Oliaro, Z. Zhang, et al. Towards Efficient Generative LLM Serving: A Survey from Algorithms to Systems.arXiv:2312.15234, 2023

  18. [18]

    Kwiatkowski, J

    T. Kwiatkowski, J. Palomaki, O. Redfield, et al. Natural Questions: A Benchmark for QA Research.TACL, 2019

  19. [19]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Panwar, J. Mohan, et al. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.arXiv:2308.16369, 2023

  20. [20]

    Agrawal, N

    A. Agrawal, N. Kedia, A. Panwar, et al. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi- Serve.arXiv:2403.02310, 2024

  21. [21]

    Y . Kim. Convolutional Neural Networks for Sentence Classification.arXiv:1408.5882, 2014

  22. [22]

    End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

    X. Ma and E. Hovy. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF.arXiv:1603.01354, 2016

  23. [23]

    Sequence to Sequence Learning with Neural Networks

    I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to Sequence Learning with Neural Networks.arXiv:1409.3215, 2014

  24. [24]

    Radford et al

    A. Radford et al. Improving Language Understanding by Generative Pre-Training. OpenAI technical report, 2018

  25. [25]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, et al. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258, 2021

  26. [26]

    FasterTransformer: Transformer Related Optimizations

    NVIDIA. FasterTransformer: Transformer Related Optimizations. GitHub repository

  27. [27]

    LightLLM

    ModelTC. LightLLM. GitHub repository, 2024

  28. [28]

    B. Wu, Y . Zhong, Z. Zhang, et al. Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920, 2023

  29. [29]

    Sheng, S

    Y . Sheng, S. Cao, D. Li, et al. Fairness in Serving Large Language Models. InOSDI, 2024

  30. [30]

    Patel, E

    P. Patel, E. Choukse, C. Zhang, et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In ISCA, 2024

  31. [31]

    C. Hu, H. Huang, L. Xu, et al. Inference Without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.arXiv:2401.11181, 2024

  32. [32]

    Zhong, S

    Y . Zhong, S. Liu, J. Chen, et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving.arXiv:2401.09670, 2024

  33. [33]

    H. Oh, K. Kim, J. Kim, et al. ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. InASPLOS, 2024

  34. [34]

    B. Sun, Z. Huang, H. Zhao, et al. Llumnix: Dynamic Scheduling for Large Language Model Serving. InOSDI, 2024

  35. [35]

    X. Miao, C. Shi, J. Duan, et al. SpotServe: Serving Generative LLMs on Preemptible Instances. InASPLOS, 2024

  36. [36]

    B. Lin, C. Zhang, T. Peng, et al. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache.arXiv:2401.02669, 2024

  37. [37]

    B. Wu, S. Liu, Y . Zhong, et al. LoongServe: Efficiently Serving Long-Context LLMs with Elastic Sequence Parallelism.arXiv:2404.09526, 2024

  38. [38]

    NVIDIA Ampere Architecture In-Depth

    NVIDIA. NVIDIA Ampere Architecture In-Depth. Technical Blog, 2020

  39. [39]

    H. Shen, H. Chang, B. Dong, et al. Efficient LLM Inference on CPUs.arXiv:2311.00502, 2023. 9

  40. [40]

    Dettmers, R

    T. Dettmers, R. Svirschevski, V . Egiazarian, et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. InICLR, 2024

  41. [41]

    Jouppi, G

    N. Jouppi, G. Kurian, S. Li, et al. TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings. InISCA, 2023

  42. [42]

    R. Lai, J. Shao, S. Feng, et al. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. arXiv:2311.02103, 2023

  43. [43]

    K. K. W. Ng, H. M. Demoulin, and V . Liu. Paella: Low-Latency Model Serving with Software-Defined GPU Scheduling. InSOSP, 2023

  44. [44]

    M. Han, H. Zhang, R. Chen, et al. Microsecond-Scale Preemption for Concurrent GPU-Accelerated DNN Inferences. InOSDI, 2022

  45. [45]

    A. Ali, R. Pinciroli, F. Yan, et al. Batch: ML Inference Serving on Serverless Platforms with Adaptive Batching. InSC, 2020

  46. [46]

    Zhang, M

    C. Zhang, M. Yu, W. Wang, et al. MArk: Cost-Effective, SLO-Aware ML Inference Serving. InUSENIX ATC, 2019

  47. [47]

    Q. Weng, W. Xiao, Y . Yu, et al. MLaaS in the Wild: Workload Analysis and Scheduling in Large Heterogeneous GPU Clusters. InNSDI, 2022. 10