Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

Aakash Patel; Dwaipayan Biswas; Joshua Klein; Joyjit Kundu

arxiv: 2604.11128 · v1 · submitted 2026-04-13 · 💻 cs.AR

Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

Joyjit Kundu , Joshua Klein , Aakash Patel , Dwaipayan Biswas This is my paper

Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3

classification 💻 cs.AR

keywords generative AIinference performancehigh bandwidth storagememory chipletroofline modelresource constrained platformslarge language modelsmemory bandwidth

0 comments

The pith

High Bandwidth Storage can deliver interactive throughput for large language models on mobiles while bonded chiplets optimize smaller ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates technology solutions to ease the intense memory capacity and bandwidth demands that generative AI inference places on resource-constrained platforms such as mobiles. Long context lengths in tasks like question answering require caching massive key-value states, and even modest concurrent serving adds complexity. The authors apply a hierarchical roofline model to assess High Bandwidth Storage for 13B parameter models, specifying the bandwidth and latency needed for acceptable interactive performance, and examine a bonded global buffer memory chiplet for 1B parameter models along with usage recommendations. A sympathetic reader would care because these insights point to practical hardware paths that could make advanced AI features viable on everyday devices without excessive power or cost penalties.

Core claim

Using a hierarchical roofline-based analytical performance model, the paper evaluates the performance implications of High Bandwidth Storage for large models with 13B parameters and extended contexts, defining the bandwidth and latency requirements to reach acceptable interactivity throughput, and assesses the value of a bonded global buffer memory chiplet for small 1B parameter models while suggesting optimal ways to employ it.

What carries the argument

hierarchical roofline-based analytical performance model used to quantify implications of High Bandwidth Storage and bonded global buffer chiplets on AI inference performance

If this is right

For large models, specific bandwidth and latency targets for High Bandwidth Storage are required to support interactive use with long contexts.
Bonded global buffer memory chiplets offer performance benefits for smaller models when utilized appropriately.
These solutions can reduce on/off-chip memory pressure allowing concurrent inference serving on constrained devices.
Applications involving multimodal inputs and long-document analysis become more feasible on mobiles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware architects might integrate High Bandwidth Storage more aggressively in mobile SoCs to expand supported model sizes.
The analytical approach could be adapted to evaluate other emerging memory technologies for AI workloads.
Real-world testing on prototypes would be needed to confirm the predicted throughput gains before widespread adoption.

Load-bearing premise

The hierarchical roofline-based analytical performance model accurately predicts real hardware behavior and performance implications for these emerging technologies without requiring empirical validation or detailed cycle-accurate simulation.

What would settle it

Measurements from a hardware prototype or cycle-accurate simulator that show actual inference throughput or latency for 13B models with HBS differing substantially from the model's predictions.

Figures

Figures reproduced from arXiv: 2604.11128 by Aakash Patel, Dwaipayan Biswas, Joshua Klein, Joyjit Kundu.

**Figure 2.** Figure 2: (a) TPS as a function of HBS bandwidth when latency [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Achieved TPS for the three different configurations of [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Performance implications of QKV chiplet for [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High Bandwidth Storage (HBS) and outline bandwidth/latency requirements to achieve an acceptable throughput for interactivity. For small models (e.g., 1B parameters), we evaluate the merit of a bonded global buffer memory chiplet and propose how to best utilize it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a hierarchical roofline model to HBS and chiplet memory for LLM inference on edge devices but offers no validation of the model's accuracy for these workloads.

read the letter

This paper looks at memory pressure from long-context LLM inference on constrained platforms and uses a hierarchical roofline model to assess two emerging fixes: High Bandwidth Storage for 13B-scale models and bonded global buffer chiplets for 1B-scale ones. It sketches bandwidth and latency targets for interactive throughput and suggests how to map the small-model case onto the extra buffer. That framing is straightforward and ties directly to real constraints like KV-cache size and concurrent serving. The practical angle on memory hierarchy choices for edge AI hardware is the part that could be useful to architects working in that niche. The model itself is the standard roofline approach extended to a hierarchy, applied to these specific technologies rather than introducing a new analysis method. The authors correctly flag the capacity and bandwidth issues that come with multimodal inputs and long contexts. The main limitation is that the performance numbers and utilization proposals rest entirely on the analytical model with no hardware measurements, cycle-accurate simulations, or even sensitivity checks against real attention access patterns. Roofline models tend to be optimistic for latency-sensitive, irregular traffic like KV-cache reads under concurrency, so the outlined requirements could shift once contention or actual device characteristics are measured. No error bars or validation steps appear in the available description. This is the sort of evaluation study that computer-architecture groups focused on edge AI might want to see, mainly for the workload-specific questions it raises rather than for any new result. It does not resolve open questions in the field but could serve as a starting point for hardware designers if the model limitations are addressed. I would send it to peer review so referees can ask for validation data or a clearer statement of where the analytical predictions are likely to diverge from silicon.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates two emerging memory technologies for reducing on/off-chip memory pressure during generative AI inference on resource-constrained platforms. Using a hierarchical roofline-based analytical performance model, it examines High Bandwidth Storage (HBS) for large models (e.g., 13B parameters) with long contexts, deriving bandwidth and latency requirements needed for acceptable interactive throughput under concurrent serving. For small models (e.g., 1B parameters), it assesses a bonded global buffer memory chiplet and proposes utilization strategies to handle KV-cache demands from multimodal inputs and applications like QA over long documents.

Significance. If the hierarchical roofline model proves accurate for the described workloads, the paper offers timely, practical guidance on hardware specifications that could enable efficient on-device deployment of LLMs, addressing a pressing challenge in edge AI. The dual focus on large-model HBS requirements and small-model chiplet optimization provides a useful framework for technology roadmapping. The analytical approach allows rapid exploration of design spaces without immediate hardware prototyping.

major comments (1)

[Evaluation sections (HBS and chiplet analysis)] The central claims regarding bandwidth/latency requirements for HBS (large models) and optimal chiplet utilization (small models) rest entirely on the hierarchical roofline model's predictions of memory access patterns, effective bandwidth, and hierarchy effects for KV-cache serving. The manuscript provides no empirical validation, cycle-accurate simulation results, or comparison to real hardware behavior for irregular, latency-sensitive operations such as attention under concurrency. This is load-bearing for the outlined requirements and proposals, as roofline models are known to be optimistic for such workloads without accounting for contention or non-ideal access patterns.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicitly stating the key assumptions of the hierarchical roofline model (e.g., access pattern simplifications for KV-cache) to allow readers to assess applicability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential impact of our analytical framework for evaluating memory technologies in on-device generative AI inference. We address the major comment regarding the lack of empirical validation in detail below.

read point-by-point responses

Referee: [Evaluation sections (HBS and chiplet analysis)] The central claims regarding bandwidth/latency requirements for HBS (large models) and optimal chiplet utilization (small models) rest entirely on the hierarchical roofline model's predictions of memory access patterns, effective bandwidth, and hierarchy effects for KV-cache serving. The manuscript provides no empirical validation, cycle-accurate simulation results, or comparison to real hardware behavior for irregular, latency-sensitive operations such as attention under concurrency. This is load-bearing for the outlined requirements and proposals, as roofline models are known to be optimistic for such workloads without accounting for contention or non-ideal access patterns.

Authors: We concur that the hierarchical roofline model, while useful for bounding performance and exploring design spaces, does not capture all aspects of real hardware behavior, particularly for concurrent, irregular memory accesses in attention layers during KV-cache operations. The manuscript is positioned as an analytical study to derive technology requirements for emerging solutions like HBS and memory chiplets, where hardware prototypes may not yet exist. Nevertheless, the referee's point is well-taken, and we have made revisions to the manuscript by adding a discussion on the limitations of the roofline approach in the context of these workloads. Specifically, we now explicitly note the potential optimism due to unmodeled contention and non-ideal access patterns, and we qualify our derived requirements as analytical estimates rather than definitive hardware specifications. We believe this enhances the paper's rigor without altering its core contributions. revision: yes

Circularity Check

0 steps flagged

No circularity detected; analytical model applied without self-referential reduction

full rationale

The paper applies a hierarchical roofline-based analytical performance model to derive bandwidth/latency requirements for 13B models with HBS and utilization proposals for 1B models with bonded global buffers. No equations appear in the abstract or provided text, no self-citations to the authors' prior derivations are invoked as load-bearing, and no fitted parameters are redefined as predictions. The model functions as an external analytical framework whose outputs (requirements and proposals) do not feed back into its own definition or inputs, making the derivation chain self-contained rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access provides no explicit free parameters, axioms, or invented entities; the roofline model is invoked without stated assumptions or calibration details.

pith-pipeline@v0.9.0 · 5485 in / 1089 out tokens · 57425 ms · 2026-05-10T16:31:57.528845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

A survey on evaluation of large language models,

Y . C. et al., “A survey on evaluation of large language models,”ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, Mar. 2024

work page 2024
[2]

Intelligent personal assistants: A systematic literature review,

A. d. et al., “Intelligent personal assistants: A systematic literature review,”Expert Systems with Applications, vol. 147, p. 113193, 2020

work page 2020
[3]

Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc,

X. Wei, J. Zhang, H. Li, J. Chen, H. Guan, R. Qu, M. Li, X. Chen, and G. Luo, “Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc,” 2026, arXiv:2506.24045

work page arXiv 2026
[4]

The Llama 3 Herd of Models

A. G. et al., “The llama 3 herd of models,” 2024, arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

work page 2021
[6]

Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

M. Davies, N. Crago, K. Sankaralingam, and C. Kozyrakis, “Lim- inal: Exploring the frontiers of llm decode performance,” 2025, arXiv:2507.14397

work page arXiv 2025
[7]

arXiv preprint arXiv:2409.09086 , year=

Z. Ning, J. Zhao, Q. Jin, W. Ding, and M. Guo, “Inf-mllm: Efficient streaming inference of multimodal large language models on a single gpu,” 2024, arXiv:2409.09086

work page arXiv 2024
[8]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

work page 2024
[9]

Llm in a flash: Efficient large language model inference with limited memory

K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” 2024, arXiv:2312.11514

work page arXiv 2024
[10]

Flexgen: high-throughput generative inference of large language models with a single gpu,

Y . S. et al., “Flexgen: high-throughput generative inference of large language models with a single gpu,” inProceedings of the 40th Inter- national Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

work page 2023
[11]

(2025) Kioxia xl-flash

toms’Hardware. (2025) Kioxia xl-flash. [Online]. Available: https: //tinyurl.com/kioxia-xl-flash

work page 2025
[12]

H3d-transformer: A heterogeneous 3d (h3d) com- puting platform for transformer model acceleration on edge devices,

Y . Luo and S. Yu, “H3d-transformer: A heterogeneous 3d (h3d) com- puting platform for transformer model acceleration on edge devices,” vol. 29, no. 3, p. 19, Apr. 2024

work page 2024
[13]

Performance modeling and workload analysis of dis- tributed large language model training and inference,

J. Kundu, W. Guo, A. BanaGozar, U. De Alwis, S. Sengupta, P. Gupta, and A. Mallik, “Performance modeling and workload analysis of dis- tributed large language model training and inference,” in2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 57–67

work page 2024
[14]

Keeping up with large language models: A holistic methodology of compute, memory, communication, and cost modeling,

W. Guo, J. Kundu, U. Tos, W. Kong, G. Sisto, T. Evenblij, and M. Perumkunnil, “Keeping up with large language models: A holistic methodology of compute, memory, communication, and cost modeling,” in2025 IEEE International Symposium on Workload Characterization (IISWC), 2025, pp. 116–126

work page 2025
[15]

Calculon: a methodology and tool for high-level co-design of systems and large language models,

M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23, New York, NY , USA, 2023, p. 14

work page 2023
[16]

Deepflow: A cross-stack pathfinding framework for distributed ai systems,

N. Ardalani, S. Pal, and P. Gupta, “Deepflow: A cross-stack pathfinding framework for distributed ai systems,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 2, pp. 1–20, 2024

work page 2024
[17]

Efficient caching with a tag-enhanced dram,

M. Babaie, A. Akram, W. Elsasser, B. Haukness, M. R. Miller, T. Song, T. V ogelsang, S. C. Woo, and J. Lowe-Power, “Efficient caching with a tag-enhanced dram,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 745–760

work page 2025

[1] [1]

A survey on evaluation of large language models,

Y . C. et al., “A survey on evaluation of large language models,”ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, Mar. 2024

work page 2024

[2] [2]

Intelligent personal assistants: A systematic literature review,

A. d. et al., “Intelligent personal assistants: A systematic literature review,”Expert Systems with Applications, vol. 147, p. 113193, 2020

work page 2020

[3] [3]

Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc,

X. Wei, J. Zhang, H. Li, J. Chen, H. Guan, R. Qu, M. Li, X. Chen, and G. Luo, “Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc,” 2026, arXiv:2506.24045

work page arXiv 2026

[4] [4]

The Llama 3 Herd of Models

A. G. et al., “The llama 3 herd of models,” 2024, arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

work page 2021

[6] [6]

Liminal: Exploring the frontiers of llm decode performance.arXiv preprint arXiv:2507.14397, 2025

M. Davies, N. Crago, K. Sankaralingam, and C. Kozyrakis, “Lim- inal: Exploring the frontiers of llm decode performance,” 2025, arXiv:2507.14397

work page arXiv 2025

[7] [7]

arXiv preprint arXiv:2409.09086 , year=

Z. Ning, J. Zhao, Q. Jin, W. Ding, and M. Guo, “Inf-mllm: Efficient streaming inference of multimodal large language models on a single gpu,” 2024, arXiv:2409.09086

work page arXiv 2024

[8] [8]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

work page 2024

[9] [9]

Llm in a flash: Efficient large language model inference with limited memory

K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” 2024, arXiv:2312.11514

work page arXiv 2024

[10] [10]

Flexgen: high-throughput generative inference of large language models with a single gpu,

Y . S. et al., “Flexgen: high-throughput generative inference of large language models with a single gpu,” inProceedings of the 40th Inter- national Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

work page 2023

[11] [11]

(2025) Kioxia xl-flash

toms’Hardware. (2025) Kioxia xl-flash. [Online]. Available: https: //tinyurl.com/kioxia-xl-flash

work page 2025

[12] [12]

H3d-transformer: A heterogeneous 3d (h3d) com- puting platform for transformer model acceleration on edge devices,

Y . Luo and S. Yu, “H3d-transformer: A heterogeneous 3d (h3d) com- puting platform for transformer model acceleration on edge devices,” vol. 29, no. 3, p. 19, Apr. 2024

work page 2024

[13] [13]

Performance modeling and workload analysis of dis- tributed large language model training and inference,

J. Kundu, W. Guo, A. BanaGozar, U. De Alwis, S. Sengupta, P. Gupta, and A. Mallik, “Performance modeling and workload analysis of dis- tributed large language model training and inference,” in2024 IEEE International Symposium on Workload Characterization (IISWC), 2024, pp. 57–67

work page 2024

[14] [14]

Keeping up with large language models: A holistic methodology of compute, memory, communication, and cost modeling,

W. Guo, J. Kundu, U. Tos, W. Kong, G. Sisto, T. Evenblij, and M. Perumkunnil, “Keeping up with large language models: A holistic methodology of compute, memory, communication, and cost modeling,” in2025 IEEE International Symposium on Workload Characterization (IISWC), 2025, pp. 116–126

work page 2025

[15] [15]

Calculon: a methodology and tool for high-level co-design of systems and large language models,

M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23, New York, NY , USA, 2023, p. 14

work page 2023

[16] [16]

Deepflow: A cross-stack pathfinding framework for distributed ai systems,

N. Ardalani, S. Pal, and P. Gupta, “Deepflow: A cross-stack pathfinding framework for distributed ai systems,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 2, pp. 1–20, 2024

work page 2024

[17] [17]

Efficient caching with a tag-enhanced dram,

M. Babaie, A. Akram, W. Elsasser, B. Haukness, M. R. Miller, T. Song, T. V ogelsang, S. C. Woo, and J. Lowe-Power, “Efficient caching with a tag-enhanced dram,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2025, pp. 745–760

work page 2025