Cache-Aided NOMA Mobile Edge Computing: A Reinforcement Learning Approach
Pith reviewed 2026-05-25 19:10 UTC · model grok-4.3
The pith
A NOMA cache-aided mobile edge computing system uses LSTM task popularity prediction and Bayesian learning automata Q-learning to maximize long-term rewards through joint offloading, resource, and caching decisions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that the BLA based action selection scheme is instantaneously self-correcting and the selected action is an optimal solution for each state. Extensive simulation results demonstrate that the proposed framework significantly outperforms the benchmarks like all local computing, all offloading computing, and non-cache computing.
What carries the argument
Bayesian learning automata based action selection scheme inside multi-agent Q-learning, which selects the optimal offloading action for every state.
If this is right
- LSTM prediction error decreases as the learning rate increases.
- The framework outperforms all-local, all-offload, and non-cache baselines.
- BLA-based multi-agent Q-learning improves on conventional reinforcement learning methods.
Where Pith is reading between the lines
- The self-correcting property of the action selector could transfer to other multi-agent reinforcement learning problems that require fast adaptation in wireless networks.
- Adding mobility or location data to the LSTM input might improve caching accuracy when users move between edge servers.
- Testing the joint optimization under sudden changes in the number of active users would reveal how well the claimed optimality scales.
Load-bearing premise
Task popularity follows temporal patterns that an LSTM can forecast accurately enough to support effective long-term reward maximization.
What would settle it
Execute the full system on real user task request traces and check whether performance remains superior to the three benchmarks or whether the BLA selector ever fails to pick the optimal action in a visited state.
Figures
read the original abstract
A novel non-orthogonal multiple access (NOMA) based cache-aided mobile edge computing (MEC) framework is proposed. For the purpose of efficiently allocating communication and computation resources to users' computation tasks requests, we propose a long-short-term memory (LSTM) network to predict the task popularity. Based on the predicted task popularity, a long-term reward maximization problem is formulated that involves a joint optimization of the task offloading decisions, computation resource allocation, and caching decisions. To tackle this challenging problem, a single-agent Q-learning (SAQ-learning) algorithm is invoked to learn a long-term resource allocation strategy. Furthermore, a Bayesian learning automata (BLA) based multi-agent Q-learning (MAQ-learning) algorithm is proposed for task offloading decisions. More specifically, a BLA based action select scheme is proposed for the agents in MAQ-learning to select the optimal action in every state. We prove that the BLA based action selection scheme is instantaneously self-correcting and the selected action is an optimal solution for each state. Extensive simulation results demonstrate that: 1) The prediction error of the proposed LSTMs based task popularity prediction decreases with increasing learning rate. 2) The proposed framework significantly outperforms the benchmarks like all local computing, all offloading computing, and non-cache computing. 3) The proposed BLA based MAQ-learning achieves an improved performance compared to conventional reinforcement learning algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a NOMA-based cache-aided MEC framework that uses an LSTM network to predict task popularity, formulates a long-term reward maximization problem jointly optimizing task offloading, computation resource allocation, and caching decisions, solves it via single-agent Q-learning (SAQ-learning) for resource allocation and a Bayesian learning automata (BLA) based multi-agent Q-learning (MAQ-learning) for offloading decisions, and claims to prove that the BLA action selection is instantaneously self-correcting and yields an optimal action per state. Extensive simulations are said to show that the LSTM prediction error decreases with learning rate, the framework outperforms benchmarks (all-local, all-offloading, non-cache), and BLA-MAQ-learning outperforms conventional RL.
Significance. If the claimed optimality proof for the BLA scheme holds under the non-stationary popularity induced by the LSTM predictor and the joint optimization, and if the simulation results are statistically robust, the work would contribute a concrete RL-based approach to dynamic resource allocation in cache-aided NOMA MEC systems, potentially improving long-term performance over static or myopic baselines.
major comments (2)
- [Abstract / proof section] Abstract (and the section containing the BLA optimality proof): The central claim that 'the BLA based action selection scheme is instantaneously self-correcting and the selected action is an optimal solution for each state' underpins the asserted superiority of MAQ-learning over conventional RL; however, the provided text gives no derivation, no statement of assumptions (e.g., reward stationarity, perfect Q-value knowledge, or handling of LSTM-induced non-stationarity), and no explicit reduction showing how the self-correcting property survives the joint offloading-caching-NOMA optimization. This must be supplied with the precise conditions under which optimality holds.
- [Abstract / simulation section] Abstract (simulation claims): The statements that the framework 'significantly outperforms' the three benchmarks and that BLA-MAQ-learning 'achieves an improved performance' are load-bearing for the practical contribution, yet no error bars, number of independent runs, confidence intervals, or data-exclusion rules are mentioned; without these, the quantitative superiority cannot be assessed.
minor comments (1)
- [Abstract] The abstract refers to 'the proposed LSTMs based task popularity prediction' (plural) but earlier mentions 'a long-short-term memory (LSTM) network' (singular); notation should be consistent.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the presentation of the optimality claim and the statistical reporting of results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / proof section] Abstract (and the section containing the BLA optimality proof): The central claim that 'the BLA based action selection scheme is instantaneously self-correcting and the selected action is an optimal solution for each state' underpins the asserted superiority of MAQ-learning over conventional RL; however, the provided text gives no derivation, no statement of assumptions (e.g., reward stationarity, perfect Q-value knowledge, or handling of LSTM-induced non-stationarity), and no explicit reduction showing how the self-correcting property survives the joint offloading-caching-NOMA optimization. This must be supplied with the precise conditions under which optimality holds.
Authors: We agree the current text states the claim without a full derivation or explicit assumptions. In revision we will add a dedicated appendix containing the complete proof of the instantaneously self-correcting property, together with the precise assumptions (stationary rewards within each prediction window, known Q-values at convergence, and the reduction showing the property holds for the joint offloading-caching-NOMA action space). We will also clarify that the LSTM predictor is used to generate a fixed popularity vector for each optimization epoch, thereby restoring stationarity for the subsequent RL stage. revision: yes
-
Referee: [Abstract / simulation section] Abstract (simulation claims): The statements that the framework 'significantly outperforms' the three benchmarks and that BLA-MAQ-learning 'achieves an improved performance' are load-bearing for the practical contribution, yet no error bars, number of independent runs, confidence intervals, or data-exclusion rules are mentioned; without these, the quantitative superiority cannot be assessed.
Authors: We concur that statistical details are required for rigorous assessment. The revised manuscript will report the number of independent Monte-Carlo runs (100), include error bars as one standard deviation, and add 95% confidence intervals for all key performance curves. Data-exclusion rules (none applied) will also be stated. revision: yes
Circularity Check
No significant circularity; derivations remain independent of inputs
full rationale
The paper's core chain—LSTM-based task popularity prediction feeding a long-term reward maximization, followed by SAQ-learning and BLA-augmented MAQ-learning with an asserted proof of instantaneous self-correction and per-state optimality—does not reduce any claimed result to its fitted inputs by construction. The LSTM component is a standard supervised predictor whose outputs are treated as exogenous inputs to the subsequent optimization; the BLA proof is presented as a standalone mathematical argument rather than a tautology or renamed fit. No self-citation chains, ansatz smuggling, or uniqueness theorems imported from prior author work appear as load-bearing elements. Simulations are reported separately as empirical validation and do not substitute for the claimed proof. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Deep reinforcement learning in cache-aided MEC networks,
Z. Yang, Y . Liu, Y . Chen, and G. Tyson, “Deep reinforcement learning in cache-aided MEC networks,” in IEEE Proc. of International Commun. Conf. (ICC) , Shanghai, China, May. 2019
work page 2019
-
[2]
Edge-cocaco: Toward joint optimization of computation, caching, and communication on edge cloud,
M. Chen, Y . Hao, L. Hu, M. S. Hossain, and A. Ghoneim, “Edge-cocaco: Toward joint optimization of computation, caching, and communication on edge cloud,” IEEE Wireless Commun., vol. 25, no. 3, pp. 21–27, Jun. 2018
work page 2018
-
[3]
M. Liu, F. R. Yu, Y . Teng, V . C. M. Leung, and M. Song, “Distributed resource allocation in blockchain-based video streaming systems with mobile edge computing,” IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 695–708, Jan. 2019
work page 2019
-
[4]
A survey on mobile edge computing: The communication perspective,
Y . Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surv. Tut. , vol. 19, no. 4, pp. 2322–2358, four quarter 2017
work page 2017
-
[5]
User association and resource allocation in unified noma enabled heterogeneous ultra dense networks,
Z. Qin, X. Yue, Y . Liu, Z. Ding, and A. Nallanathan, “User association and resource allocation in unified noma enabled heterogeneous ultra dense networks,” IEEE Commun. Mag. , vol. 56, no. 6, pp. 86–92, Jun. 2018
work page 2018
-
[6]
Nonorthogonal multiple access for 5G and beyond,
Y . Liu, Z. Qin, M. Elkashlan, Z. Ding, A. Nallanathan, and L. Hanzo, “Nonorthogonal multiple access for 5G and beyond,” Proc. IEEE, vol. 105, no. 12, pp. 2347–2381, Dec. 2017
work page 2017
-
[7]
Application of non-orthogonal multiple access in LTE and 5G networks,
Z. Ding, Y . Liu, J. Choi, Q. Sun, M. Elkashlan, C. I, and H. V . Poor, “Application of non-orthogonal multiple access in LTE and 5G networks,” IEEE Commun. Mag. , vol. 55, no. 2, pp. 185–191, Feb. 2017
work page 2017
-
[8]
Cache-aided non-orthogonal multiple access: The two-user case,
L. Xiang, D. W. K. Ng, X. Ge, Z. Ding, V . W. S. Wong, and R. Schober, “Cache-aided non-orthogonal multiple access: The two-user case,” IEEE J. Sel. Topic Signal Processing. , pp. 1–1, 2019
work page 2019
-
[9]
Energy-efficient resource allocation for cache-assisted mobile edge computing,
Y . Cui, W. He, C. Ni, C. Guo, and Z. Liu, “Energy-efficient resource allocation for cache-assisted mobile edge computing,” in IEEE Proc. of Loc. Com. Netw. (LCN) , Oct. 2017, pp. 640–648
work page 2017
-
[10]
Dynamic computation offloading for mobile-edge computing with energy harvesting devices,
Y . Mao, J. Zhang, and K. B. Letaief, “Dynamic computation offloading for mobile-edge computing with energy harvesting devices,” IEEE J. Sel. Areas Commun. , vol. 34, no. 12, pp. 3590–3605, Dec. 2016
work page 2016
-
[11]
Energy-efficient resource allocation for mobile-edge computation offloading,
C. You, K. Huang, H. Chae, and B. H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Trans. Wireless Commun. , vol. 16, no. 3, pp. 1397–1411, Mar. 2017
work page 2017
-
[12]
X. Chen, H. Zhang, C. Wu, S. Mao, Y . Ji, and M. Bennis, “Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning,” ArXiv, May 2018. [Online]. Available: http://arxiv.org/abs/1805.06146
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Impact of non-orthogonal multiple access on the offloading of mobile edge computing,
Z. Ding, P. Fan, and H. V . Poor, “Impact of non-orthogonal multiple access on the offloading of mobile edge computing,” IEEE Trans. Commun. , vol. 67, no. 1, pp. 375–390, Jan. 2019
work page 2019
-
[14]
Multi-antenna NOMA for computation offloading in multiuser mobile edge computing systems,
F. Wang, J. Xu, and Z. Ding, “Multi-antenna NOMA for computation offloading in multiuser mobile edge computing systems,” IEEE Trans. Commun. , vol. 67, no. 3, pp. 2450–2463, Mar. 2019
work page 2019
-
[15]
Edge computing aware NOMA for 5G networks,
A. Kiani and N. Ansari, “Edge computing aware NOMA for 5G networks,” IEEE Int. of Things , vol. 5, no. 2, pp. 1299–1306, Apr. 2018
work page 2018
-
[16]
Energy-efficient NOMA-based mobile edge computing offloading,
Y . Pan, M. Chen, Z. Yang, N. Huang, and M. Shikh-Bahaei, “Energy-efficient NOMA-based mobile edge computing offloading,” IEEE Commun. Lett. , vol. 23, no. 2, pp. 310–313, Feb. 2019
work page 2019
-
[17]
Delay minimization for NOMA-MEC offloading,
Z. Ding, D. W. K. Ng, R. Schober, and H. V . Poor, “Delay minimization for NOMA-MEC offloading,” IEEE Signal Process. Lett., vol. 25, no. 12, pp. 1875–1879, Dec. 2018
work page 2018
-
[18]
Z. Song, Y . Liu, and X. Sun, “Joint radio and computational resource allocation for NOMA-based mobile edge computing in heterogeneous networks,” IEEE Commun. Lett. , vol. 22, no. 12, pp. 2559–2562, Dec. 2018
work page 2018
-
[19]
Interplay between NOMA and other emerging technologies: A survey,
M. Vaezi, G. Amarasuriya, Y . Liu, A. Arafa, and Z. D. Fang Fang, “Interplay between NOMA and other emerging technologies: A survey,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1903.10489
-
[20]
NOMA assisted wireless caching: Strategies and performance analysis,
Z. Ding, P. Fan, G. K. Karagiannidis, R. Schober, and H. V . Poor, “NOMA assisted wireless caching: Strategies and performance analysis,” IEEE Trans. Commun. , vol. 66, no. 10, pp. 4854–4876, Oct. 2018
work page 2018
-
[21]
Coverage performance of NOMA in wireless caching networks,
Z. Zhao, M. Xu, W. Xie, Z. Ding, and G. K. Karagiannidis, “Coverage performance of NOMA in wireless caching networks,” IEEE Commun. Lett. , vol. 22, no. 7, pp. 1458–1461, Jul. 2018. 30
work page 2018
-
[22]
Cache-Aided Non-Orthogonal Multiple Access
L. Xiang, D. W. K. Ng, X. Ge, and Z. Ding, “Cache-aided non-orthogonal multiple access,” ArXiv, 2018. [Online]. Available: https://arxiv.org/abs/1712.09557
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Leveraging Edge Caching in NOMA Systems with QoS Requirements
J. A. Oviedo and H. R. Sadjadpour, “Leveraging edge caching in NOMA systems with QoS requirements,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1801.07430
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Mode selection between index coding and superposition coding in cache-based NOMA networks,
Y . Fu, Y . Liu, H. Wang, Z. Shi, and Y . Liu, “Mode selection between index coding and superposition coding in cache-based NOMA networks,” IEEE Commun. Lett. , vol. 23, no. 3, pp. 478–481, Mar. 2019
work page 2019
-
[25]
Security in mobile edge caching with reinforcement learning,
L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, and M. Guizani, “Security in mobile edge caching with reinforcement learning,” IEEE Wireless Commun., vol. 25, no. 3, pp. 116–122, Jun. 2018
work page 2018
-
[26]
Communications, caching, and computing for next generation HetNets,
Y . Zhou, F. R. Yu, J. Chen, and Y . Kuo, “Communications, caching, and computing for next generation HetNets,” IEEE Wireless Commun., vol. 25, no. 4, pp. 104–111, Aug. 2018
work page 2018
-
[27]
A branch and bound algorithm for feature subset selection,
Narendra and Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Trans. Comput. , vol. C-26, no. 9, pp. 917–922, Sep. 1977
work page 1977
-
[28]
D. P. Bertsekas, Dynamic programming and optimal control . Athena scientific Belmont, MA, 2005, vol. 1, no. 3
work page 2005
-
[29]
Deep learning in physical layer communications,
Z. Qin, H. Ye, G. Y . Li, and B. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Commun. , vol. 26, no. 2, pp. 93–99, Apr. 2019
work page 2019
-
[30]
Multiple-step-ahead traffic prediction in high-speed networks,
A. Bayati, K. Khoa Nguyen, and M. Cheriet, “Multiple-step-ahead traffic prediction in high-speed networks,” IEEE Commun. Lett., vol. 22, no. 12, pp. 2447–2450, Dec. 2018
work page 2018
-
[31]
Learning statistical scripts with LSTM recurrent neural networks,
K. Pichotta and R. J. Mooney, “Learning statistical scripts with LSTM recurrent neural networks,” The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) , 2016
work page 2016
-
[32]
Predrnn: Recurrent neural networks for predictive learning using spatiotemporal LSTMs,
Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal LSTMs,” Advances in Neural Information Processing Systems 30 (NIPS-17) , 2017
work page 2017
-
[33]
Learning resource allocation and pricing for cloud profit maximization,
B. Du, C. Wu, and Z. Huang, “Learning resource allocation and pricing for cloud profit maximization,” The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) , 2019
work page 2019
-
[34]
S. Ghoorchian and S. Maghsudi, “Multi-armed bandit for energy-efficient and delay-sensitive edge computing in dynamic networks with uncertainty,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1904.06258
-
[35]
A. Asheralieva, “Bayesian reinforcement learning-based coalition formation for distributed resource sharing by device-to- device users in heterogeneous cellular networks,” IEEE Trans. Wireless Commun. , vol. 16, no. 8, pp. 5016–5032, Aug. 2017
work page 2017
-
[36]
Learning Automata Based Q-learning for Content Placement in Cooperative Caching
Z. Yang, Y . Liu, Y . Chen, and L. Jiao, “Learning automata based Q-learning for content placement in cooperative caching,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1903.06235
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Joint offloading and computing optimization in wireless powered mobile-edge computing systems,
F. Wang, J. Xu, X. Wang, and S. Cui, “Joint offloading and computing optimization in wireless powered mobile-edge computing systems,” IEEE Trans. Wireless Commun. , vol. 17, no. 3, pp. 1784–1797, Mar. 2018
work page 2018
-
[38]
Energy efficient mobile cloud computing powered by wireless energy transfer,
C. You, K. Huang, and H. Chae, “Energy efficient mobile cloud computing powered by wireless energy transfer,” IEEE J. Sel. Areas Commun. , vol. 34, no. 5, pp. 1757–1771, May 2016
work page 2016
-
[39]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997
work page 1997
-
[40]
A learning algorithm for continually running fully recurrent neural networks,
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989
work page 1989
-
[41]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction . Cambridge, MA, USA: MIT Press, 2016
work page 2016
-
[42]
A bayesian learning automaton for solving two-armed bernoulli bandit problems,
O. Granmo, “A bayesian learning automaton for solving two-armed bernoulli bandit problems,” in International Conference on Machine Learning and Applications , Dec. 2008, pp. 23–30
work page 2008
-
[43]
Echo state networks for proactive caching in cloud-based radio access networks with mobile users,
M. Chen, W. Saad, C. Yin, and M. Debbah, “Echo state networks for proactive caching in cloud-based radio access networks with mobile users,” IEEE Trans. Wireless Commun. , vol. 16, no. 6, pp. 3520–3535, Jun. 2017
work page 2017
-
[44]
Solving two-armed bernoulli bandit problems using a bayesian learning automaton,
O.-C. Granmo, “Solving two-armed bernoulli bandit problems using a bayesian learning automaton,” International Journal of Intelligent Computing and Cybernetics , vol. 3, no. 2, pp. 207–234, 2010
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.