Recognition: no theorem link
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
Pith reviewed 2026-05-16 23:50 UTC · model grok-4.3
The pith
ODMA raises KV-cache utilization up to 19% on LPDDR accelerators by predicting generation lengths and dynamically adjusting allocation buckets with a safety pool.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ODMA advances generation-length prediction by addressing distribution drift that invalidates static bucket boundaries and performance fragility under heavy-tailed request patterns. It integrates a lightweight length predictor with adaptive bucket partitioning whose boundaries are dynamically recalibrated via online histograms to maximize utilization, while a fallback safety pool ensures robustness against prediction errors. The resulting contiguous allocations are shown to increase KV-cache utilization by up to 19.25% absolute and throughput by 23-27% over static baselines on Cambricon MLU370-X4 accelerators running DeepSeek-R1-Distill-Qwen-7B.
What carries the argument
Adaptive bucket partitioning whose boundaries are recalibrated online from request-length histograms, backed by a lightweight predictor and a reserved safety pool for mispredictions.
If this is right
- KV-cache utilization rises by up to 19.25% absolute compared with static worst-case provisioning.
- Serving throughput increases 23-27% on the same LPDDR hardware and model.
- Prediction accuracy improves from 98.60% to 99.55% on Alpaca and from 82.68% to 93.36% on Google-NQ.
- Contiguous allocation becomes feasible on accelerators that cannot tolerate the random-access cost of paging.
Where Pith is reading between the lines
- The same predictor-plus-adaptive-bucket pattern could reduce memory waste for other variable-length GPU workloads beyond LLM decoding.
- Deployments on LPDDR hardware could support larger batch sizes or longer contexts without adding HBM.
- Integration with request schedulers might further reduce safety-pool pressure by grouping similar-length requests.
- The technique is likely to remain useful as models grow if the histogram window is tuned to the rate of distribution change.
Load-bearing premise
Online histogram recalibration will keep bucket allocations contiguous and low-overhead even when real production request lengths drift or follow heavy tails without exhausting the safety pool.
What would settle it
A multi-day production request trace exhibiting clear distribution drift where KV-cache utilization and per-request allocation overhead are measured and show no net gain over a static baseline.
Figures
read the original abstract
Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in production workloads: (i) distribution drift that invalidates static bucket boundaries, and (ii) performance fragility under heavy-tailed request patterns. ODMA integrates a lightweight length predictor with adaptive bucket partitioning and a fallback safety pool. Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization, while the safety pool ensures robustness against prediction errors. On Alpaca and Google-NQ benchmarks, ODMA improves S3's prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators demonstrates that ODMA increases KV-cache utilization by up to 19.25% (absolute) and throughput (TPS) by 23-27% over static baselines, validating the efficacy of predictor-driven contiguous allocation for LPDDR-class devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ODMA, an on-demand memory allocation strategy for LLM serving on LPDDR-class accelerators such as Cambricon MLU. It combines a lightweight generation-length predictor with adaptive bucket partitioning (via online histogram recalibration of boundaries) and a fallback safety pool to handle prediction errors. The approach targets distribution drift and heavy-tailed request patterns that invalidate static allocations. On Alpaca and Google-NQ, ODMA raises S3 prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. In deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, it reports up to 19.25% absolute KV-cache utilization gain and 23-27% throughput (TPS) improvement over static baselines.
Significance. If the performance numbers hold after additional controls, the work addresses a practical gap: efficient contiguous allocation on bandwidth-constrained LPDDR accelerators where HBM-style paging is unsuitable. The online histogram adaptation for drift and the safety-pool robustness mechanism could influence memory managers for production LLM serving on edge or cost-sensitive hardware.
major comments (3)
- [Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.
- [Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.
- [Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.
minor comments (2)
- [Abstract] Abstract contains run-on sentences and missing spaces (e.g., 'contiguity,it incurs').
- [Throughout] Define all acronyms (TPS, KV-cache, S3) on first use and ensure consistent terminology throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will incorporate the suggested additions and clarifications into the revised manuscript to improve transparency and experimental rigor.
read point-by-point responses
-
Referee: [Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.
Authors: We agree that these metrics are important for verifying the overhead claims. In the revised manuscript we will add explicit measurements of safety-pool hit rates, histogram-update latency, and recalibration frequency collected from the DeepSeek-R1-Distill-Qwen-7B deployment runs on Cambricon MLU370-X4. These counters will be reported in the Deployment results section together with the existing utilization and throughput numbers. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.
Authors: We acknowledge the need for more complete experimental reporting. The revised Experimental evaluation section will specify the number of independent runs performed, include standard deviations or confidence intervals for the TPS measurements, discuss statistical significance, and provide precise details on how the static baselines were configured, including the sizing methodology for worst-case KV-cache provisioning. revision: yes
-
Referee: [Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.
Authors: We recognize that ablation and sensitivity results would strengthen the central claim. We will add new analysis in the Prediction and allocation sections that reports the observed frequency of bucket-boundary changes and quantifies the resulting impact on memory bandwidth and allocation contiguity under the heavy-tailed request-length distributions from the Alpaca and Google-NQ workloads. revision: yes
Circularity Check
No circularity: results are direct hardware measurements
full rationale
The paper reports empirical gains in KV-cache utilization (up to 19.25%) and TPS (23-27%) from deployment measurements on Cambricon MLU370-X4 with DeepSeek-R1-Distill-Qwen-7B, plus accuracy lifts on Alpaca/Google-NQ benchmarks. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs; the online histogram recalibration and safety pool are described as mechanisms whose overhead is validated externally rather than assumed tautologically. No self-citation chains or ansatzes are load-bearing in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generation length distributions in production workloads exhibit drift and heavy tails that static buckets cannot handle
Reference graph
Works this paper leans on
-
[1]
T. B. Brown, B. Mann, N. Ryder, et al. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, 2020, pp. 1877–1901
work page 2020
-
[2]
L. Steiner, M. Jung, M. Huonker, and N. Wehn. Unveiling the Real Performance of LPDDR5 Memories. arXiv:2209.14756, 2022
-
[3]
X. L. Dong, S. Moon, Y . E. Xu, et al. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. InKDD, 2023
work page 2023
-
[4]
R. Pope, S. Douglas, A. Chowdhery, et al. Efficiently Scaling Transformer Inference. InMLSys, 2023
work page 2023
- [5]
-
[6]
MLU370-X4 Smart Accelerator Card (official product page)
Cambricon. MLU370-X4 Smart Accelerator Card (official product page). https://www.cambricon.com/, accessed 2025
work page 2025
-
[7]
Z. Ye. FlashInfer. GitHub repository, 2024
work page 2024
-
[8]
Z. Feng, D. Guo, D. Tang, et al. CodeBERT: A Pre-trained Model for Programming and Natural Languages. arXiv:2002.08155, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[9]
P. Liu, W. Yuan, J. Fu, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP. ACM Computing Surveys, 55(9):1–35, 2023
work page 2023
- [10]
-
[11]
J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. InICML, 2001
work page 2001
-
[12]
Z. Zhou, X. Ning, K. Hong, et al. A Survey on Efficient Inference for Large Language Models.arXiv:2404.14294, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [13]
-
[14]
W. Kwon, Z. Li, S. Zhuang, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. InSOSP, 2023
work page 2023
-
[15]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [16]
- [17]
-
[18]
T. Kwiatkowski, J. Palomaki, O. Redfield, et al. Natural Questions: A Benchmark for QA Research.TACL, 2019
work page 2019
-
[19]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
A. Agrawal, A. Panwar, J. Mohan, et al. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.arXiv:2308.16369, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
A. Agrawal, N. Kedia, A. Panwar, et al. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi- Serve.arXiv:2403.02310, 2024
-
[21]
Y . Kim. Convolutional Neural Networks for Sentence Classification.arXiv:1408.5882, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
X. Ma and E. Hovy. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF.arXiv:1603.01354, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Sequence to Sequence Learning with Neural Networks
I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to Sequence Learning with Neural Networks.arXiv:1409.3215, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
A. Radford et al. Improving Language Understanding by Generative Pre-Training. OpenAI technical report, 2018
work page 2018
-
[25]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, et al. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
FasterTransformer: Transformer Related Optimizations
NVIDIA. FasterTransformer: Transformer Related Optimizations. GitHub repository
- [27]
- [28]
- [29]
- [30]
- [31]
- [32]
-
[33]
H. Oh, K. Kim, J. Kim, et al. ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. InASPLOS, 2024
work page 2024
-
[34]
B. Sun, Z. Huang, H. Zhao, et al. Llumnix: Dynamic Scheduling for Large Language Model Serving. InOSDI, 2024
work page 2024
-
[35]
X. Miao, C. Shi, J. Duan, et al. SpotServe: Serving Generative LLMs on Preemptible Instances. InASPLOS, 2024
work page 2024
- [36]
- [37]
-
[38]
NVIDIA Ampere Architecture In-Depth
NVIDIA. NVIDIA Ampere Architecture In-Depth. Technical Blog, 2020
work page 2020
- [39]
-
[40]
T. Dettmers, R. Svirschevski, V . Egiazarian, et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. InICLR, 2024
work page 2024
- [41]
- [42]
-
[43]
K. K. W. Ng, H. M. Demoulin, and V . Liu. Paella: Low-Latency Model Serving with Software-Defined GPU Scheduling. InSOSP, 2023
work page 2023
-
[44]
M. Han, H. Zhang, R. Chen, et al. Microsecond-Scale Preemption for Concurrent GPU-Accelerated DNN Inferences. InOSDI, 2022
work page 2022
-
[45]
A. Ali, R. Pinciroli, F. Yan, et al. Batch: ML Inference Serving on Serverless Platforms with Adaptive Batching. InSC, 2020
work page 2020
- [46]
-
[47]
Q. Weng, W. Xiao, Y . Yu, et al. MLaaS in the Wild: Workload Analysis and Scheduling in Large Heterogeneous GPU Clusters. InNSDI, 2022. 10
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.