arxiv: 2601.13506 · v3 · submitted 2026-01-20 · 💻 cs.IT · math.IT

Recognition: unknown

Group Relative Policy Optimization for Robust Blind Interference Alignment with Fluid Antennas

Jianqiu Peng , Tong Zhang , Shuai Wang , Mingjie Shao , Hao Xu , Rui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:06 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords fluid antenna systemblind interference alignmentgroup relative policy optimizationimperfect CSIdeep reinforcement learningsum-rate maximizationMISO downlinkantenna position optimization

0 comments

The pith

Group relative policy optimization solves robust sum-rate maximization for fluid antenna positions in blind interference alignment under imperfect CSI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a fluid antenna system that uses dynamic switching to enable blind interference alignment in a K-user MISO downlink. It formulates the problem of choosing antenna positions to maximize sum rate when channel state information is imperfect and solves it with group relative policy optimization, a reinforcement learning method that drops the critic network. GRPO uses group-based exploration to learn error distributions in the channel estimates, which cuts model size and computation nearly in half compared with proximal policy optimization while raising achieved rates.

Core claim

This paper shows that group relative policy optimization (GRPO) can be applied to the non-convex problem of selecting fluid antenna switching positions to maximize sum rate in a K-user MISO downlink that uses blind interference alignment, even when channel state information is imperfect; the group-based exploration mechanism learns the distribution of estimation errors, removes the need for a critic network, and produces both lower complexity and higher sum rates than standard proximal policy optimization or simple heuristics.

What carries the argument

Group relative policy optimization (GRPO), a deep reinforcement learning algorithm that removes the critic network and performs policy updates via group-based exploration to learn channel error distributions.

If this is right

GRPO reduces both model parameters and floating-point operations by nearly half relative to PPO, making real-time antenna-position control feasible on resource-limited devices.
The same group-exploration approach can be reused for other non-convex wireless resource-allocation tasks that involve learning error statistics.
Because GRPO escapes bad local optima more reliably than standard PPO, it produces larger gains over heuristic baselines when CSI error variance is high.
The framework extends blind interference alignment from fixed antennas to reconfigurable fluid antennas without requiring perfect CSI at the transmitter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduced model size may allow GRPO to run on edge devices that jointly optimize antenna positions and user scheduling in real time.
If the learned error distribution transfers across environments, the same policy could be fine-tuned with far fewer samples than training from scratch.
Combining GRPO with other reconfigurable surfaces such as RIS might create a larger joint optimization space for interference management.
The performance gap over heuristics suggests that learning-based methods become essential once antenna reconfiguration speed exceeds the coherence time of the channel.

Load-bearing premise

The gains depend on the assumption that group-based exploration can learn the true distribution of channel estimation errors and that the simulation channel models match real-world imperfect CSI conditions.

What would settle it

Measure actual sum-rate performance in a hardware testbed using real fluid antennas and measured CSI errors, then compare GRPO against PPO and the MaximumGain heuristic on the same hardware.

Figures

Figures reproduced from arXiv: 2601.13506 by Hao Xu, Jianqiu Peng, Mingjie Shao, Rui Wang, Shuai Wang, Tong Zhang.

**Figure 1.** Figure 1: Illustration of the considered K-User MISO downlink BIA system model The received signals are given by y1(3) = √ Ph1[1]s2 + n1(3), (11a) y2(3) = √ Ph2[2]s2 + n2(3), (11b) where fluid antenna of the user 1 is at “Position 1” resulting in h1[2], and the fluid antenna of the user 2 is at “Position 2” resulting in h2[2]. Note that “Position 1” and “Position 2” may correspond to different fluid antenna location… view at source ↗

**Figure 2.** Figure 2: Illustrating the DRL algorithms (PPO-Init, PPO, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Proposed GRPO solution. B. Results and Discussion on CSI Error Influence [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison. TABLE II: GRPO gains over baselines (%) Baseline η = 10 η = 1 η = 0.1 η = 0.01 PPO 6.95 7.40 1.74 0.59 PPO-Init 59.42 39.86 9.97 11.89 MaximumGain 447.77 184.47 95.91 74.97 RandomGain 1083.41 458.52 205.28 114.30 reduction of CSI error, qualified by η. This shows that our system is highly sensitive to CSI error, and reducing this error is significantly beneficial. For the severe CSI… view at source ↗

read the original abstract

Fluid antenna system (FAS) leverages dynamic reconfigurability to unlock spatial degrees of freedom and reshape wireless channels. Blind interference alignment (BIA) aligns interference through antenna switching. This paper proposes, for the first time, a robust fluid antenna-driven BIA framework for a K-user MISO downlink under imperfect channel state information (CSI). We formulate a robust sum-rate maximization problem through optimizing fluid antenna positions (switching positions). To solve this challenging non-convex problem, we employ group relative policy optimization (GRPO), a novel deep reinforcement learning algorithm that eliminates the critic network. This robust design reduces model size and floating point operations (FLOPs) by nearly half compared to proximal policy optimization (PPO) while significantly enhancing performance through group-based exploration that escapes bad local optima. Simulation results demonstrate that GRPO outperforms PPO by 4.17%, and a 100K-step pre-trained PPO by 30.29%. Due to error distribution learning, GRPO exceeds heuristic MaximumGain and RandomGain by 200.78% and 465.38%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRPO gives a lighter RL method for fluid-antenna BIA under imperfect CSI and reports solid simulation gains, but the error model behind those gains is never described.

read the letter

The paper's main contribution is GRPO, a critic-free policy optimization algorithm that uses group-based relative updates instead of a value network. They apply it to choosing fluid-antenna positions for blind interference alignment in a K-user MISO downlink when CSI is noisy. The approach cuts model size and FLOPs roughly in half compared with PPO while improving sum-rate in their tests. That reduction in complexity is a practical plus for this kind of real-time position optimization problem. The group exploration step is presented as a way to escape poor local solutions, which makes sense for the non-convex position search they face. Simulation numbers show GRPO ahead of PPO by a few percent and well ahead of simple heuristics, which is the kind of result that could interest people working on dynamic antennas for 5G/6G interference management. The soft spot is the complete lack of detail on the CSI error model used in the simulator. The abstract credits the large gains to learning the error distribution, yet there is no description of how that distribution was generated, no ablation on group size, and no comparison of learned versus actual error statistics. Without those pieces it is difficult to tell whether the reported improvements are general or tied to a particular synthetic setup. The paper is aimed at researchers who combine reinforcement learning with fluid-antenna or reconfigurable intelligent surface systems. A reader already working on robust BIA or low-complexity DRL for wireless resource allocation would get value from the GRPO formulation and the complexity numbers. It deserves peer review so the experimental section can be examined for reproducibility and for any additional validation of the error model.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a robust blind interference alignment (BIA) framework for fluid antenna systems (FAS) in a K-user MISO downlink under imperfect CSI. It formulates a non-convex sum-rate maximization problem over fluid antenna positions and solves it with Group Relative Policy Optimization (GRPO), a novel critic-free deep RL algorithm that uses group-based exploration. The paper claims GRPO halves model size and FLOPs relative to PPO while delivering simulation gains of 4.17% over PPO, 30.29% over 100K-step pre-trained PPO, 200.78% over MaximumGain, and 465.38% over RandomGain, attributing the improvements to learned CSI error distributions.

Significance. If the simulation results prove reproducible and the synthetic CSI error model is representative of physical FAS dynamics, GRPO could provide a lighter-weight RL alternative for non-convex wireless optimization problems. The reported complexity reduction and performance margins would be relevant for practical robust BIA deployments, especially if the group-relative mechanism generalizes beyond the chosen simulator.

major comments (3)

[Simulation Results] Simulation Results section: the headline performance deltas (4.17% over PPO, 200.78% over MaximumGain) are presented without any description of the CSI error distribution, number of Monte Carlo trials, error bars, or channel model parameters. This directly undermines the central claim that gains arise from GRPO's error-distribution learning rather than simulator artifacts.
[Proposed GRPO Method] GRPO algorithm description: the manuscript states that GRPO eliminates the critic network via group-relative exploration, yet supplies neither the explicit advantage estimator, the group-size update rule, nor convergence analysis. Without these, it is impossible to verify that the reported stability and local-optima escape are properties of the algorithm rather than tuning choices.
[System Model] System Model and Simulation Setup: the weakest assumption—that the injected CSI error distribution is both physically representative and learnable by the group mechanism—is never tested. No ablation on the free parameter 'group size', no comparison of learned versus ground-truth error statistics, and no sensitivity analysis to error variance are provided, making the 200+% heuristic gains load-bearing on an unverified modeling choice.

minor comments (2)

[Abstract] Abstract: the phrase 'for the first time' should be accompanied by a brief citation to prior FAS-BIA literature to avoid overstatement.
[Method] Notation: the definition of the group-relative advantage is introduced without an equation number or explicit formula, complicating traceability to the PPO baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Simulation Results] Simulation Results section: the headline performance deltas (4.17% over PPO, 200.78% over MaximumGain) are presented without any description of the CSI error distribution, number of Monte Carlo trials, error bars, or channel model parameters. This directly undermines the central claim that gains arise from GRPO's error-distribution learning rather than simulator artifacts.

Authors: We agree with the referee that more details on the simulation setup are essential for reproducibility and to substantiate our claims. In the revised version, we will expand the Simulation Results section to include a full description of the CSI error distribution (e.g., the specific Gaussian model parameters used), the number of Monte Carlo trials (1000 independent runs), error bars indicating standard deviation across trials, and all relevant channel model parameters such as path loss exponents and noise variance. This will help demonstrate that the performance gains stem from GRPO's ability to learn the error distribution. revision: yes
Referee: [Proposed GRPO Method] GRPO algorithm description: the manuscript states that GRPO eliminates the critic network via group-relative exploration, yet supplies neither the explicit advantage estimator, the group-size update rule, nor convergence analysis. Without these, it is impossible to verify that the reported stability and local-optima escape are properties of the algorithm rather than tuning choices.

Authors: We acknowledge that the algorithmic details of GRPO require more explicit presentation to allow independent verification. We will revise the Proposed GRPO Method section to include the mathematical expression for the group-relative advantage estimator, the rule for updating the group size during training, and a high-level convergence discussion based on the relative policy updates. These additions will clarify that the observed stability and escape from local optima are inherent to the group-based mechanism. revision: yes
Referee: [System Model] System Model and Simulation Setup: the weakest assumption—that the injected CSI error distribution is both physically representative and learnable by the group mechanism—is never tested. No ablation on the free parameter 'group size', no comparison of learned versus ground-truth error statistics, and no sensitivity analysis to error variance are provided, making the 200+% heuristic gains load-bearing on an unverified modeling choice.

Authors: We recognize the importance of validating the CSI error model assumptions. In the revision, we will incorporate an ablation study varying the group size parameter and reporting its impact on performance, a comparison of the error statistics learned by GRPO against the ground-truth distribution used in simulations, and a sensitivity analysis showing how performance varies with different error variances. These experiments will strengthen the justification for the modeling choice and the reported gains over heuristics. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on independent simulation comparisons

full rationale

The paper introduces GRPO as a novel algorithm and reports its performance via direct simulation comparisons against PPO, pre-trained PPO, MaximumGain, and RandomGain baselines. No derivation chain reduces a claimed result to a fitted parameter or self-citation by construction; the sum-rate maximization is solved numerically, and the reported percentage gains are empirical outputs rather than algebraic identities. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the new GRPO algorithm and the simulation models for imperfect CSI in FAS-BIA systems.

free parameters (1)

group size in GRPO
Likely a hyperparameter chosen for the algorithm to enable group-based exploration.

invented entities (1)

Group Relative Policy Optimization (GRPO) no independent evidence
purpose: Novel DRL method without critic for antenna position optimization
Presented as new but evidence is simulation results only.

pith-pipeline@v0.9.0 · 5495 in / 1269 out tokens · 40916 ms · 2026-05-16T13:06:32.396102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Fluid antenna systems,

K.-K. Wong, A. Shojaeifard, K.-F. Tong, and Y . Zhang, “Fluid antenna systems,”IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1950–1962, 2021

work page 1950
[2]

An efficient sum-rate maximization algorithm for fluid antenna-assisted ISAC system,

Q. Zhang, M. Shao, T. Zhang, G. Chen, J. Liu, and P. C. Ching, “An efficient sum-rate maximization algorithm for fluid antenna-assisted ISAC system,”IEEE Commun. Lett., vol. 29, no. 1, pp. 200–204, 2025

work page 2025
[3]

Optimizing fluid antenna configurations for constructive interference precoding,

W. Sun, M. Shao, L. Zhu, Y . Ge, T. Zhang, and Z. Liu, “Optimizing fluid antenna configurations for constructive interference precoding,” in 2025 IEEE/CIC International Conference on Communications in China (ICCC), pp. 1–6, 2025

work page 2025
[4]

Indoor fluid antenna systems enabled by layout-specific modeling and group relative policy optimization,

T. Zhang, Q. Li, S. Wang, W. Ni, J. Zhang, R. Wang, K.-K. Wong, and C.-B. Chae, “Indoor fluid antenna systems enabled by layout-specific modeling and group relative policy optimization,” 2025

work page 2025
[5]

Capacity maximization for fas- assisted multiple access channels,

H. Xu, K.-K. Wong, W. K. New, F. R. Ghadi, G. Zhou, R. Murch, C.-B. Chae, Y . Zhu, and S. Jin, “Capacity maximization for fas- assisted multiple access channels,”IEEE Trans. Commun., vol. 73, no. 7, pp. 4713–4731, 2025

work page 2025
[6]

Revisiting outage probability analysis for two-user fluid antenna multiple access system,

H. Xu, K.-K. Wong, W. K. New, K.-F. Tong, Y . Zhang, and C.-B. Chae, “Revisiting outage probability analysis for two-user fluid antenna multiple access system,”IEEE Trans. Wireless Commun., vol. 23, no. 8, pp. 9534–9548, 2024

work page 2024
[7]

Slow fluid antenna multiple access,

K.-K. Wong, D. Morales-Jimenez, K.-F. Tong, and C.-B. Chae, “Slow fluid antenna multiple access,”IEEE Trans. Commun., vol. 71, no. 5, pp. 2831–2846, 2023

work page 2023
[8]

cGAN- based slow fluid antenna multiple access,

M. Eskandari, A. G. Burr, K. Cumanan, and K.-K. Wong, “cGAN- based slow fluid antenna multiple access,”IEEE Wireless Commun. Lett., vol. 13, no. 10, pp. 2907–2911, 2024

work page 2024
[9]

Deep learning enabled slow fluid antenna multiple access,

N. Waqar, K.-K. Wong, K.-F. Tong, A. Sharples, and Y . Zhang, “Deep learning enabled slow fluid antenna multiple access,”IEEE Commun. Lett., vol. 27, no. 3, pp. 861–865, 2023

work page 2023
[10]

Turbocharging fluid antenna multiple access,

N. Waqar, K.-K. Wong, C.-B. Chae, and R. Murch, “Turbocharging fluid antenna multiple access,”IEEE Trans. Wireless Commun., pp. 1–1, 2025

work page 2025
[11]

Blind interference alignment,

S. A. Jafar, “Blind interference alignment,”IEEE J. Sel. Top. Signal Process., vol. 6, no. 3, pp. 216–227, 2012

work page 2012
[12]

Aiming perfectly in the dark- blind interference alignment through staggered antenna switching,

T. Gou, C. Wang, and S. A. Jafar, “Aiming perfectly in the dark- blind interference alignment through staggered antenna switching,”IEEE Trans. Signal Process., vol. 59, no. 6, pp. 2734–2744, 2011

work page 2011
[13]

Degrees of freedom characterization: The 3-user SISO interference channel with blind interference alignment,

C. Wang, “Degrees of freedom characterization: The 3-user SISO interference channel with blind interference alignment,”IEEE Commun. Lett., vol. 18, no. 5, pp. 757–760, 2014

work page 2014
[14]

Blind interference alignment for cellular networks,

M. Morales-C ´espedes, J. Plata-Chaves, D. Toumpakaris, S. A. Jafar, and A. G. Armada, “Blind interference alignment for cellular networks,” IEEE Trans. Signal Process., vol. 63, no. 1, pp. 41–56, 2015

work page 2015
[15]

BIA for the K-user interference channel using reconfigurable antenna at receivers,

M. Johnny and M. R. Aref, “BIA for the K-user interference channel using reconfigurable antenna at receivers,”IEEE Trans. Inf. Theory, vol. 66, no. 4, pp. 2184–2197, 2020

work page 2020
[16]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi,et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,”Nature, vol. 645, no. 8081, pp. 633– 638, 2025

work page 2025
[17]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Movable antenna aided NOMA: Joint antenna positioning, precoding, and decoding design,

Z. Xiao, Z. Li, L. Zhu, B. Ning, D. B. D. Costa, X.-G. Xia, and R. Zhang, “Movable antenna aided NOMA: Joint antenna positioning, precoding, and decoding design,”IEEE Trans. Wireless Commun., pp. 1–1, 2025

work page 2025
[19]

A framework of robust transmission design for IRS-aided MISO communications with imperfect cascaded channels,

G. Zhou, C. Pan, H. Ren, K. Wang, and A. Nallanathan, “A framework of robust transmission design for IRS-aided MISO communications with imperfect cascaded channels,”IEEE Trans. Signal Process., vol. 68, pp. 5092–5106, 2020

work page 2020
[20]

An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel,

Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted MMSE approach to distributed sum-utility maximization for a MIMO interfering broadcast channel,”IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4331–4340, 2011

work page 2011