arxiv: 2512.09427 · v5 · submitted 2025-12-10 · 💻 cs.AR · cs.AI

Recognition: no theorem link

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Guoqiang Zou , Wanyu Wang , Hao Zheng , Longxiang Yin , Yinhe Han

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:50 UTC · model grok-4.3

classification 💻 cs.AR cs.AI

keywords LLM servingmemory allocationKV-cache managementLPDDR acceleratorsgeneration length predictionadaptive bucketingon-demand allocation

0 comments

The pith

ODMA raises KV-cache utilization up to 19% on LPDDR accelerators by predicting generation lengths and dynamically adjusting allocation buckets with a safety pool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets memory management bottlenecks when serving large language models on accelerators whose random-access bandwidth is too low for conventional paging. Static pre-allocation reserves too much space for worst-case lengths and leaves large unused regions, while fine-grained paging destroys bandwidth on LPDDR-class hardware. ODMA therefore predicts each request's output length, places it into an adaptively sized bucket whose boundaries are updated from running histograms, and falls back to a small safety pool when predictions miss. This combination keeps allocations contiguous and low-overhead even when request distributions drift or exhibit heavy tails. Experiments on Alpaca, Google-NQ, and real Cambricon MLU370-X4 hardware show both higher cache occupancy and 23-27% more tokens per second than static baselines.

Core claim

ODMA advances generation-length prediction by addressing distribution drift that invalidates static bucket boundaries and performance fragility under heavy-tailed request patterns. It integrates a lightweight length predictor with adaptive bucket partitioning whose boundaries are dynamically recalibrated via online histograms to maximize utilization, while a fallback safety pool ensures robustness against prediction errors. The resulting contiguous allocations are shown to increase KV-cache utilization by up to 19.25% absolute and throughput by 23-27% over static baselines on Cambricon MLU370-X4 accelerators running DeepSeek-R1-Distill-Qwen-7B.

What carries the argument

Adaptive bucket partitioning whose boundaries are recalibrated online from request-length histograms, backed by a lightweight predictor and a reserved safety pool for mispredictions.

If this is right

KV-cache utilization rises by up to 19.25% absolute compared with static worst-case provisioning.
Serving throughput increases 23-27% on the same LPDDR hardware and model.
Prediction accuracy improves from 98.60% to 99.55% on Alpaca and from 82.68% to 93.36% on Google-NQ.
Contiguous allocation becomes feasible on accelerators that cannot tolerate the random-access cost of paging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictor-plus-adaptive-bucket pattern could reduce memory waste for other variable-length GPU workloads beyond LLM decoding.
Deployments on LPDDR hardware could support larger batch sizes or longer contexts without adding HBM.
Integration with request schedulers might further reduce safety-pool pressure by grouping similar-length requests.
The technique is likely to remain useful as models grow if the histogram window is tuned to the rate of distribution change.

Load-bearing premise

Online histogram recalibration will keep bucket allocations contiguous and low-overhead even when real production request lengths drift or follow heavy tails without exhausting the safety pool.

What would settle it

A multi-day production request trace exhibiting clear distribution drift where KV-cache utilization and per-request allocation overhead are measured and show no net gain over a static baseline.

Figures

Figures reproduced from arXiv: 2512.09427 by Guoqiang Zou, Hao Zheng, Longxiang Yin, Wanyu Wang, Yinhe Han.

**Figure 1.** Figure 1: ODMA overview. User prompts are annotated by the Predictor and inserted into a Task Pool. The Scheduler groups tagged tasks into batches and sends them to the Runtime, which interacts with the Allocator. The Allocator (with Cluster Manager and per-device Memory Pools) allocates bucket-tagged contiguous blocks on LPDDR-class accelerators [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Generation-length prediction in ODMA. A lightweight encoder maps prompt features and request metadata to a length estimate Lˆ and an uncertainty score u, which together determine the bucket tag used by the allocator. when available, it also incorporates contextual signals like conversation depth. These signals are designed to be cheap to compute and to remain stable across model backends, enabling a predic… view at source ↗

**Figure 3.** Figure 3: Prediction accuracy: ODMA vs. S3 [5]. Left: Alpaca [10]; Right: Google-NQ [18] [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Throughput (TPS) improvement of ODMA over a static pre-allocation baseline (Cambricon-vLLM). Left: Alpaca; Right: Google-NQ. from 82.68% to 93.36%. These gains enable tighter bucket sizing in our deployment without incurring frequent overflows. 5.2 Throughput and Utilization [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Device-memory utilization with ODMA. Left: Alpaca; Right: Google-NQ. Scheduling/batching/MLaaS. Paella and microsecond-scale preemption consider low-latency multi-tenant GPU scheduling [43, 44]; adaptive batching and cost-aware serving are explored in Batch/MArk [45, 46]; cluster-scale MLaaS studies inform instrumentation and goodput optimization [47]. Hardware and compilers. Ampere A100 and TPU v4 describ… view at source ↗

read the original abstract

Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in production workloads: (i) distribution drift that invalidates static bucket boundaries, and (ii) performance fragility under heavy-tailed request patterns. ODMA integrates a lightweight length predictor with adaptive bucket partitioning and a fallback safety pool. Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization, while the safety pool ensures robustness against prediction errors. On Alpaca and Google-NQ benchmarks, ODMA improves S3's prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators demonstrates that ODMA increases KV-cache utilization by up to 19.25% (absolute) and throughput (TPS) by 23-27% over static baselines, validating the efficacy of predictor-driven contiguous allocation for LPDDR-class devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ODMA gives a practical on-demand allocation fix for LPDDR accelerators with real hardware gains, but the missing overhead numbers leave the net benefit unclear.

read the letter

ODMA is an on-demand memory allocation scheme for LLM serving on LPDDR-class accelerators like the Cambricon MLU series. It combines a lightweight length predictor, online histogram recalibration of bucket boundaries, and a safety pool for prediction misses. The goal is to keep allocations contiguous without the worst-case waste of static pre-allocation or the random-access penalties of paging on low-bandwidth hardware. That specific integration for LPDDR constraints is the main new piece compared to earlier static or HBM-focused methods. The paper reports clear accuracy lifts on Alpaca and Google-NQ, moving S3 prediction from 98.60% to 99.55% and 82.68% to 93.36%. The deployment run with DeepSeek-R1-Distill-Qwen-7B on MLU370-X4 shows up to 19.25% higher KV-cache utilization and 23-27% better throughput than static baselines. Those numbers come from direct hardware measurements, which is a plus. The soft spots are the absent counters. No data appears on safety-pool hit rate, histogram update latency, or recalibration frequency under the actual request-length distribution. If the safety pool is invoked often or the updates add measurable bandwidth pressure, the utilization edge shrinks. The abstract also gives no variance numbers or exact baseline implementation details, so the throughput claim rests on moderate evidence. This paper is for engineers tuning inference on bandwidth-limited chips who need concrete allocation tricks rather than new theory. It deserves peer review because the hardware results are grounded enough for referees to request the missing overhead metrics and test robustness.

Referee Report

3 major / 2 minor

Summary. The paper introduces ODMA, an on-demand memory allocation strategy for LLM serving on LPDDR-class accelerators such as Cambricon MLU. It combines a lightweight generation-length predictor with adaptive bucket partitioning (via online histogram recalibration of boundaries) and a fallback safety pool to handle prediction errors. The approach targets distribution drift and heavy-tailed request patterns that invalidate static allocations. On Alpaca and Google-NQ, ODMA raises S3 prediction accuracy from 98.60% to 99.55% and 82.68% to 93.36%, respectively. In deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4, it reports up to 19.25% absolute KV-cache utilization gain and 23-27% throughput (TPS) improvement over static baselines.

Significance. If the performance numbers hold after additional controls, the work addresses a practical gap: efficient contiguous allocation on bandwidth-constrained LPDDR accelerators where HBM-style paging is unsuitable. The online histogram adaptation for drift and the safety-pool robustness mechanism could influence memory managers for production LLM serving on edge or cost-sensitive hardware.

major comments (3)

[Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.
[Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.
[Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.

minor comments (2)

[Abstract] Abstract contains run-on sentences and missing spaces (e.g., 'contiguity,it incurs').
[Throughout] Define all acronyms (TPS, KV-cache, S3) on first use and ensure consistent terminology throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will incorporate the suggested additions and clarifications into the revised manuscript to improve transparency and experimental rigor.

read point-by-point responses

Referee: [Deployment results] Deployment results section: the headline claims of 19.25% KV-cache utilization and 23-27% TPS gains rest on unquantified safety-pool hit rates, histogram-update latency, and recalibration frequency under the actual request-length distribution of the DeepSeek-R1-Distill-Qwen-7B runs. Without these counters, it is impossible to verify that the safety pool remains a small fraction or that recalibration stays low-overhead on LPDDR.

Authors: We agree that these metrics are important for verifying the overhead claims. In the revised manuscript we will add explicit measurements of safety-pool hit rates, histogram-update latency, and recalibration frequency collected from the DeepSeek-R1-Distill-Qwen-7B deployment runs on Cambricon MLU370-X4. These counters will be reported in the Deployment results section together with the existing utilization and throughput numbers. revision: yes
Referee: [Experimental evaluation] Experimental evaluation: no details are supplied on the number of runs, measurement variance, statistical significance, or exact implementation of the static baselines (including how worst-case provisioning was sized). This leaves the 23-27% TPS improvement only moderately supported.

Authors: We acknowledge the need for more complete experimental reporting. The revised Experimental evaluation section will specify the number of independent runs performed, include standard deviations or confidence intervals for the TPS measurements, discuss statistical significance, and provide precise details on how the static baselines were configured, including the sizing methodology for worst-case KV-cache provisioning. revision: yes
Referee: [Prediction and allocation] Prediction and allocation sections: the claim that online histogram recalibration keeps allocations contiguous and low-overhead under heavy-tailed drift is central to the contribution, yet no ablation or sensitivity data are provided showing how often bucket boundaries change or how this affects bandwidth on the target hardware.

Authors: We recognize that ablation and sensitivity results would strengthen the central claim. We will add new analysis in the Prediction and allocation sections that reports the observed frequency of bucket-boundary changes and quantifies the resulting impact on memory bandwidth and allocation contiguity under the heavy-tailed request-length distributions from the Alpaca and Google-NQ workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct hardware measurements

full rationale

The paper reports empirical gains in KV-cache utilization (up to 19.25%) and TPS (23-27%) from deployment measurements on Cambricon MLU370-X4 with DeepSeek-R1-Distill-Qwen-7B, plus accuracy lifts on Alpaca/Google-NQ benchmarks. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs; the online histogram recalibration and safety pool are described as mechanisms whose overhead is validated externally rather than assumed tautologically. No self-citation chains or ansatzes are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical effectiveness of length prediction and dynamic partitioning rather than formal axioms or new physical entities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Generation length distributions in production workloads exhibit drift and heavy tails that static buckets cannot handle
Invoked to justify the need for online histogram recalibration.

pith-pipeline@v0.9.0 · 5616 in / 1196 out tokens · 41294 ms · 2026-05-16T23:50:54.180680+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 8 internal anchors

[1]

T. B. Brown, B. Mann, N. Ryder, et al. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, 2020, pp. 1877–1901

work page 2020
[2]

Steiner, M

L. Steiner, M. Jung, M. Huonker, and N. Wehn. Unveiling the Real Performance of LPDDR5 Memories. arXiv:2209.14756, 2022

work page arXiv 2022
[3]

X. L. Dong, S. Moon, Y . E. Xu, et al. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. InKDD, 2023

work page 2023
[4]

R. Pope, S. Douglas, A. Chowdhery, et al. Efficiently Scaling Transformer Inference. InMLSys, 2023

work page 2023
[5]

Jin, C.-F

Y . Jin, C.-F. Wu, D. Brooks, et al. S3: Increasing GPU Utilization During Generative Inference for Higher Throughput. InNeurIPS, 2023

work page 2023
[6]

MLU370-X4 Smart Accelerator Card (official product page)

Cambricon. MLU370-X4 Smart Accelerator Card (official product page). https://www.cambricon.com/, accessed 2025

work page 2025
[7]

Z. Ye. FlashInfer. GitHub repository, 2024

work page 2024
[8]

Z. Feng, D. Guo, D. Tang, et al. CodeBERT: A Pre-trained Model for Programming and Natural Languages. arXiv:2002.08155, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[9]

P. Liu, W. Yuan, J. Fu, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP. ACM Computing Surveys, 55(9):1–35, 2023

work page 2023
[10]

Taori, I

R. Taori, I. Gulrajani, T. Zhang, et al. Stanford Alpaca: An Instruction-Following LLaMA Model. GitHub repository, 2023. 8

work page 2023
[11]

Lafferty, A

J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. InICML, 2001

work page 2001
[12]

Z. Zhou, X. Ning, K. Hong, et al. A Survey on Efficient Inference for Large Language Models.arXiv:2404.14294, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Vaswani et al

A. Vaswani et al. Attention is All You Need. InNeurIPS, 2017

work page 2017
[14]

W. Kwon, Z. Li, S. Zhuang, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. InSOSP, 2023

work page 2023
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Bengio, A

Y . Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives.IEEE TPAMI, 35(8):1798–1828, 2013

work page 2013
[17]

X. Miao, G. Oliaro, Z. Zhang, et al. Towards Efficient Generative LLM Serving: A Survey from Algorithms to Systems.arXiv:2312.15234, 2023

work page arXiv 2023
[18]

Kwiatkowski, J

T. Kwiatkowski, J. Palomaki, O. Redfield, et al. Natural Questions: A Benchmark for QA Research.TACL, 2019

work page 2019
[19]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, et al. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Agrawal, N

A. Agrawal, N. Kedia, A. Panwar, et al. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi- Serve.arXiv:2403.02310, 2024

work page arXiv 2024
[21]

Y . Kim. Convolutional Neural Networks for Sentence Classification.arXiv:1408.5882, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

X. Ma and E. Hovy. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF.arXiv:1603.01354, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Sequence to Sequence Learning with Neural Networks

I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to Sequence Learning with Neural Networks.arXiv:1409.3215, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Radford et al

A. Radford et al. Improving Language Understanding by Generative Pre-Training. OpenAI technical report, 2018

work page 2018
[25]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, et al. On the Opportunities and Risks of Foundation Models. arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

FasterTransformer: Transformer Related Optimizations

NVIDIA. FasterTransformer: Transformer Related Optimizations. GitHub repository

work page
[27]

LightLLM

ModelTC. LightLLM. GitHub repository, 2024

work page 2024
[28]

B. Wu, Y . Zhong, Z. Zhang, et al. Fast Distributed Inference Serving for Large Language Models. arXiv:2305.05920, 2023

work page arXiv 2023
[29]

Sheng, S

Y . Sheng, S. Cao, D. Li, et al. Fairness in Serving Large Language Models. InOSDI, 2024

work page 2024
[30]

Patel, E

P. Patel, E. Choukse, C. Zhang, et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In ISCA, 2024

work page 2024
[31]

C. Hu, H. Huang, L. Xu, et al. Inference Without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads.arXiv:2401.11181, 2024

work page arXiv 2024
[32]

Zhong, S

Y . Zhong, S. Liu, J. Chen, et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving.arXiv:2401.09670, 2024

work page arXiv 2024
[33]

H. Oh, K. Kim, J. Kim, et al. ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. InASPLOS, 2024

work page 2024
[34]

B. Sun, Z. Huang, H. Zhao, et al. Llumnix: Dynamic Scheduling for Large Language Model Serving. InOSDI, 2024

work page 2024
[35]

X. Miao, C. Shi, J. Duan, et al. SpotServe: Serving Generative LLMs on Preemptible Instances. InASPLOS, 2024

work page 2024
[36]

B. Lin, C. Zhang, T. Peng, et al. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache.arXiv:2401.02669, 2024

work page arXiv 2024
[37]

B. Wu, S. Liu, Y . Zhong, et al. LoongServe: Efficiently Serving Long-Context LLMs with Elastic Sequence Parallelism.arXiv:2404.09526, 2024

work page arXiv 2024
[38]

NVIDIA Ampere Architecture In-Depth

NVIDIA. NVIDIA Ampere Architecture In-Depth. Technical Blog, 2020

work page 2020
[39]

H. Shen, H. Chang, B. Dong, et al. Efficient LLM Inference on CPUs.arXiv:2311.00502, 2023. 9

work page arXiv 2023
[40]

Dettmers, R

T. Dettmers, R. Svirschevski, V . Egiazarian, et al. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. InICLR, 2024

work page 2024
[41]

Jouppi, G

N. Jouppi, G. Kurian, S. Li, et al. TPU v4: An Optically Reconfigurable Supercomputer for ML with Hardware Support for Embeddings. InISCA, 2023

work page 2023
[42]

R. Lai, J. Shao, S. Feng, et al. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. arXiv:2311.02103, 2023

work page arXiv 2023
[43]

K. K. W. Ng, H. M. Demoulin, and V . Liu. Paella: Low-Latency Model Serving with Software-Defined GPU Scheduling. InSOSP, 2023

work page 2023
[44]

M. Han, H. Zhang, R. Chen, et al. Microsecond-Scale Preemption for Concurrent GPU-Accelerated DNN Inferences. InOSDI, 2022

work page 2022
[45]

A. Ali, R. Pinciroli, F. Yan, et al. Batch: ML Inference Serving on Serverless Platforms with Adaptive Batching. InSC, 2020

work page 2020
[46]

Zhang, M

C. Zhang, M. Yu, W. Wang, et al. MArk: Cost-Effective, SLO-Aware ML Inference Serving. InUSENIX ATC, 2019

work page 2019
[47]

Q. Weng, W. Xiao, Y . Yu, et al. MLaaS in the Wild: Workload Analysis and Scheduling in Large Heterogeneous GPU Clusters. InNSDI, 2022. 10

work page 2022