Deep Reinforcement Learning for Hybrid RIS Assisted MIMO Communications
Pith reviewed 2026-05-16 11:00 UTC · model grok-4.3
The pith
A deep reinforcement learning model learns to map channel state information directly to near-optimal beamforming and hybrid RIS configurations in MIMO systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DRL framework learns a direct mapping from channel state information to near-optimal transmit beamforming vectors and HRIS reflection and amplification coefficients. Simulation results demonstrate that the DRL-based method achieves 95% of the spectral efficiency obtained by the alternating optimization benchmark while significantly lowering computational complexity.
What carries the argument
Deep reinforcement learning agent that takes channel state information as input and outputs the transmit beamforming vector together with the HRIS reflection and amplification coefficients.
If this is right
- After offline training the system produces configurations in a single forward pass, enabling low-latency operation.
- The approach avoids runtime iteration, making it feasible for larger arrays where iterative methods scale poorly.
- Performance within 5 percent of the benchmark indicates the learned policy effectively navigates the non-convex landscape.
- Offline training on simulated data allows deployment without repeated heavy optimization at the base station.
Where Pith is reading between the lines
- Periodic retraining on updated channel statistics would likely be needed to maintain performance in highly dynamic environments.
- Incorporating hardware impairment models directly into the training environment could close the sim-to-real gap.
- The same training structure could be reused for multi-user or multi-RIS scenarios by expanding the action space.
Load-bearing premise
The DRL policy trained on simulated channels will generalize to real-world propagation conditions and hardware impairments without significant performance loss.
What would settle it
Apply the trained DRL model to measured real-world MIMO channel data from a hardware testbed and compare the achieved spectral efficiency against the alternating-optimization solution computed on the identical measured channels.
Figures
read the original abstract
Hybrid reconfigurable intelligent surfaces (HRIS) enhance wireless systems by combining passive reflection with active signal amplification. However, jointly optimizing the transmit beamforming with the HRIS reflection and amplification coefficients to maximize spectral efficiency (SE) is a non-convex problem, and conventional iterative solutions are computationally intensive. To address this, we propose a deep reinforcement learning (DRL) framework that learns a direct mapping from channel state information to the near-optimal transmit beamforming and HRIS configurations. The DRL model is trained offline, after which it can compute the beamforming and HRIS configurations with low complexity and latency. Simulation results demonstrate that our DRL-based method achieves 95% of the SE obtained by the alternating optimization benchmark, while significantly lowering the computational complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a deep reinforcement learning (DRL) framework to jointly optimize transmit beamforming and hybrid reconfigurable intelligent surface (HRIS) reflection/amplification coefficients in MIMO systems for maximizing spectral efficiency (SE). After offline training on synthetic channels with perfect CSI, the learned policy maps channel state information directly to near-optimal configurations, achieving 95% of the SE of an alternating-optimization benchmark while substantially lowering online computational complexity and latency.
Significance. If the reported performance holds under the stated assumptions, the work supplies a practical low-latency alternative to iterative solvers for non-convex HRIS-MIMO optimization. The empirical demonstration of complexity reduction after offline training is a concrete strength that could support real-time deployment once training details and robustness are clarified.
major comments (2)
- [Abstract and Simulation Results] Abstract and Simulation Results section: the central claim that the DRL method reaches 95% of the alternating-optimization SE is presented without any description of the channel model, noise assumptions, training procedure (episodes, learning rate, discount factor, network sizes), reward function, or statistical significance of the 95% figure. These omissions are load-bearing because the performance comparison rests entirely on the chosen simulation setup.
- [DRL Framework] DRL Framework section: the state-action-reward formulation (CSI as state, beamforming plus HRIS coefficients as action, SE as reward) is outlined at a high level, yet no information is given on how the continuous action space is parameterized, how feasibility constraints are enforced, or how the policy network is trained to avoid invalid configurations. This detail is required to evaluate whether the reported SE ratio is reproducible.
minor comments (1)
- [Figures] Figure captions and axis labels in the complexity and SE plots should explicitly state the MIMO dimensions, number of HRIS elements, and SNR range used.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that additional implementation and simulation details are necessary for reproducibility and have revised the manuscript to address both major points.
read point-by-point responses
-
Referee: [Abstract and Simulation Results] Abstract and Simulation Results section: the central claim that the DRL method reaches 95% of the alternating-optimization SE is presented without any description of the channel model, noise assumptions, training procedure (episodes, learning rate, discount factor, network sizes), reward function, or statistical significance of the 95% figure. These omissions are load-bearing because the performance comparison rests entirely on the chosen simulation setup.
Authors: We acknowledge that these details were insufficiently described. In the revised manuscript we have added a new subsection (Section IV-A) that specifies: (i) the channel model (Rician fading with K=3 dB, path-loss exponent 2.2, and 1000 Monte-Carlo realizations), (ii) noise model (circularly symmetric AWGN with variance N0 = -90 dBm), (iii) training hyperparameters (10^5 episodes, learning rate 3e-4, discount factor 0.99, actor-critic networks with two hidden layers of 256 units each), (iv) reward function (instantaneous spectral efficiency), and (v) statistical reporting (mean SE ratio of 0.95 with standard deviation 0.02 across runs). The abstract has also been updated to reference this subsection. revision: yes
-
Referee: [DRL Framework] DRL Framework section: the state-action-reward formulation (CSI as state, beamforming plus HRIS coefficients as action, SE as reward) is outlined at a high level, yet no information is given on how the continuous action space is parameterized, how feasibility constraints are enforced, or how the policy network is trained to avoid invalid configurations. This detail is required to evaluate whether the reported SE ratio is reproducible.
Authors: We have substantially expanded Section III-B to describe: (i) continuous action parameterization (tanh output scaled to the feasible amplitude/phase ranges for each HRIS element and beamforming vector), (ii) constraint enforcement (a differentiable projection layer followed by a small penalty term in the reward for any residual violation), and (iii) training safeguards (experience replay with constraint-satisfying samples and a safety critic that penalizes invalid actions during policy updates). These additions make the implementation fully reproducible from the revised text. revision: yes
Circularity Check
No significant circularity; empirical simulation results are independent of inputs
full rationale
The paper formulates a standard DRL setup (state = CSI, action = beamforming + HRIS coefficients, reward = SE) and trains it offline on synthetic channels. The central claim is an empirical outcome: the trained policy reaches 95% of the SE produced by an external alternating-optimization solver while reducing complexity. No equation reduces to a fitted parameter by construction, no load-bearing self-citation chain exists, and the benchmark solver is independent. The derivation chain is self-contained against the stated simulation assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- DRL hyperparameters (learning rate, discount factor, network sizes)
axioms (1)
- domain assumption Channel state information is perfectly known at the agent during both training and inference.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The DRL agent is trained using the PPO algorithm... state st = [vec(Re(Hd)); ...] ... action at = [vec(Re(F)); ... a_n, ϕ_n] ... reward rt = R(F(t), Θ(t))
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Simulation results demonstrate that our DRL-based method achieves 95% of the SE obtained by the alternating optimization benchmark
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reconfigurable intelligent surfaces: Principles and opportunities,
Y. Liu, X. Liu, X. Mu, T. Hou, J. Xu, M. Di Renzo, and N. Al-Dhahir, “Reconfigurable intelligent surfaces: Principles and opportunities,” IEEE Commun. Surveys Tuts., vol. 23, no. 3, pp. 1546–1577, 2021
work page 2021
-
[2]
H. Zhou, M. Erol-Kantarci, Y. Liu, and H. V. Poor, “A survey on model-based, heuristic, and machine learning optimization approaches in ris-aided wireless networks,” IEEE Commun. Surveys Tuts., vol. 26, no. 2, pp. 781–823, 2023
work page 2023
-
[3]
Reconfigurable intelligent surfaces [scanning the issue],
M. Di Renzo and S. Tretyakov, “Reconfigurable intelligent surfaces [scanning the issue],” Proc. IEEE, vol. 110, no. 9, pp. 1159–1163, 2022
work page 2022
-
[4]
Hybrid relay- reflecting intelligent surface-assisted wireless communications,
N. T. Nguyen, Q.-D. Vu, K. Lee, and M. Juntti, “Hybrid relay- reflecting intelligent surface-assisted wireless communications,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6228–6244, 2022
work page 2022
-
[5]
Hybrid active-passive reconfigurable intelligent surface-assisted multi-user miso systems,
N. T. Nguyen, V.-D. Nguyen, Q. Wu, A. Tölli, S. Chatzinotas, and M. Juntti, “Hybrid active-passive reconfigurable intelligent surface-assisted multi-user miso systems,” in Proc. IEEE Works. on Sign. Proc. Adv. in Wirel. Comms. IEEE, 2022, pp. 1–5
work page 2022
-
[6]
Active re- configurable intelligent surface-aided wireless communications,
R. Long, Y.-C. Liang, Y. Pei, and E. G. Larsson, “Active re- configurable intelligent surface-aided wireless communications,” IEEE Trans. Wireless Commun., vol. 20, no. 8, pp. 4962–4975, 2021
work page 2021
-
[7]
N. T. Nguyen, V.-D. Nguyen, H. Van Nguyen, H. Q. Ngo, S. Chatzinotas, and M. Juntti, “Spectral efficiency analysis of hybrid relay-reflecting intelligent surface-assisted cell-free massive mimo systems,” IEEE Transactions on Wireless Com- munications, vol. 22, no. 5, pp. 3397–3416, 2022
work page 2022
-
[8]
Hybrid active-passive reconfigurable intelligent surface-assisted UA V communications,
N. T. Nguyen, V.-D. Nguyen, Q. Wu, A. Tölli, S. Chatzinotas, and M. Juntti, “Hybrid active-passive reconfigurable intelligent surface-assisted UA V communications,” in Proc. IEEE Global Commun. Conf., 2022, pp. 3126–3131
work page 2022
-
[9]
Hybrid active-passive reconfigurable intelligent surface- assisted multi-user MISO systems,
——, “Hybrid active-passive reconfigurable intelligent surface- assisted multi-user MISO systems,” in Proc. IEEE Works. on Sign. Proc. Adv. in Wirel. Comms., 2022
work page 2022
-
[10]
N. T. Nguyen, V. Nguyen, H. V. Nguyen, H. Q. Ngo, S. Chatzinotas, M. Juntti et al., “Downlink throughput of cell- free massive MIMO systems assisted by hybrid relay-reflecting intelligent surfaces,” in Proc. IEEE Int. Conf. Commun., 2022
work page 2022
-
[11]
Y. Ju, S. Gong, H. Liu, C. Xing, J. An, and Y. Li, “Beamforming optimization for hybrid active-passive ris assisted wireless com- munications: A rate-maximization perspective,” IEEE Trans. Commun., 2024
work page 2024
-
[12]
Efficient active elements selection algorithm for hybrid ris- assisted d2d communication system,
G. Mu, P. Zhang, Y. Hou, S. Zhong, L. Huang, and T. Yuan, “Efficient active elements selection algorithm for hybrid ris- assisted d2d communication system,” IEEE Commun. Lett., vol. 28, no. 2, pp. 377–381, 2023
work page 2023
-
[13]
Enabling large intelligent surfaces with compressive sensing and deep learning,
A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent surfaces with compressive sensing and deep learning,” IEEE Access, vol. 9, pp. 44 304–44 321, 2021
work page 2021
-
[14]
Phase configuration learning in wireless networks with multiple reconfigurable intelligent surfaces,
G. C. Alexandropoulos, S. Samarakoon, M. Bennis, and M. Deb- bah, “Phase configuration learning in wireless networks with multiple reconfigurable intelligent surfaces,” in Proc. IEEE Global Commun. Conf. IEEE, 2020, pp. 1–6
work page 2020
-
[15]
Deep learning-based phase re- configuration for intelligent reflecting surfaces,
Ö. Özdogan and E. Björnson, “Deep learning-based phase re- configuration for intelligent reflecting surfaces,” arXiv preprint arXiv:2009.13988, 2020
-
[16]
Deep reinforcement learning-based intelligent reflecting surface for secure wireless communications,
H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, and Q. Wu, “Deep reinforcement learning-based intelligent reflecting surface for secure wireless communications,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 375–388, 2020
work page 2020
-
[17]
Deep re- inforcement learning for intelligent reflecting surfaces: Towards standalone operation,
A. Taha, Y. Zhang, F. B. Mismar, and A. Alkhateeb, “Deep re- inforcement learning for intelligent reflecting surfaces: Towards standalone operation,” in Proc. IEEE Works. on Sign. Proc. Adv. in Wirel. Comms. IEEE, 2020, pp. 1–5
work page 2020
-
[18]
H. Yang, Z. Xiong, J. Zhao, D. Niyato, Q. Wu, H. V. Poor, and M. Tornatore, “Intelligent reflecting surface assisted anti- jamming communications: A fast reinforcement learning ap- proach,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1963–1974, 2020
work page 1963
-
[19]
X. Guo, Y. Chen, and Y. Wang, “Learning-based robust and secure transmission for reconfigurable intelligent surface aided millimeter wave uav communications,” IEEE Wireless Commun. Lett., vol. 10, no. 8, pp. 1795–1799, 2021
work page 2021
-
[20]
K. Feng, Q. Wang, X. Li, and C.-K. Wen, “Deep reinforcement learning based intelligent reflecting surface optimization for miso communication systems,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 745–749, 2020
work page 2020
-
[21]
Hybrid beamforming for ris-empowered multi-hop terahertz communications: A drl-based method,
C. Huang, Z. Yang, G. C. Alexandropoulos, K. Xiong, L. Wei, C. Yuen, and Z. Zhang, “Hybrid beamforming for ris-empowered multi-hop terahertz communications: A drl-based method,” in Proc. IEEE Global Commun. Conf. IEEE, 2020, pp. 1–6
work page 2020
-
[22]
Deep reinforcement learning for energy-efficient networking with reconfigurable intelligent surfaces,
G. Lee, M. Jung, A. T. Z. Kasgari, W. Saad, and M. Bennis, “Deep reinforcement learning for energy-efficient networking with reconfigurable intelligent surfaces,” in Proc. IEEE Int. Conf. Commun
-
[23]
C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface assisted multiuser miso systems exploiting deep rein- forcement learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, 2020
work page 2020
-
[24]
R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1
work page 1998
-
[25]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advan- tage estimation,” arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Capacity characterization for intel- ligent reflecting surface aided mimo communication,
S. Zhang and R. Zhang, “Capacity characterization for intel- ligent reflecting surface aided mimo communication,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1823–1838, 2020
work page 2020
-
[28]
Hybrid-ris empowered uav-assisted isac systems: Transfer learning-based drl,
P. Saikia, A. Jee, K. Singh, W.-J. Huang, A.-A. A. Boulogeorgos, and T. A. Tsiftsis, “Hybrid-ris empowered uav-assisted isac systems: Transfer learning-based drl,” IEEE Trans. Commun., 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.