Policy-Driven DRL-Based TXOP Adaptation in NR-U and Wi-Fi Coexistence

Chiapin Wang; Po-Heng Chou; Shou-Yu Chen; Yi-Fang Yu

arxiv: 2605.00457 · v3 · pith:HV37WFL7new · submitted 2026-05-01 · 💻 cs.NI · cs.LG· cs.SY· eess.SY

Policy-Driven DRL-Based TXOP Adaptation in NR-U and Wi-Fi Coexistence

Po-Heng Chou , Yi-Fang Yu , Shou-Yu Chen , Chiapin Wang This is my paper

Pith reviewed 2026-05-09 19:00 UTC · model grok-4.3

classification 💻 cs.NI cs.LGcs.SYeess.SY

keywords NR-U Wi-Fi coexistencedeep reinforcement learningTXOP controlfairness-throughput tradeoffMarkov decision processpolicy-driven DRLunlicensed spectrum sharingJain fairness index

0 comments

The pith

A policy-driven deep reinforcement learning framework models NR-U and Wi-Fi coexistence as a Markov decision process and uses reward design to learn TXOP control policies that set explicit operating points on the fairness-throughput curve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the imbalance in unlicensed spectrum when 5G NR-U and Wi-Fi use incompatible channel access rules. It casts the interaction as a Markov decision process and trains a deep Q-network to adjust transmit opportunities based on observed states and rewards. Changing the reward function creates three separate policies that prioritize absolute fairness, moderate fairness, or overall utility. Simulations show these policies reach Jain fairness above 0.9 while delivering large measured gains in throughput and utility over fixed baselines. The work illustrates how reward engineering can give operators tunable system-level control without repeated manual retuning.

Core claim

The coexistence process is formulated as a Markov decision process and a deep Q-network learns control policies through online interaction. A policy layer via reward design enables explicit control of system-level tradeoffs among fairness, throughput, and quality of service. Three policies, namely absolute fairness, moderate fairness, and utility-based fairness, are developed to achieve different operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared to absolute fairness, moderate fairness improves aggregate throughput by 68.22 percent, while the utility-based policy further enhances utility by 1

What carries the argument

The policy layer created by reward design inside DQN training, which encodes absolute fairness, moderate fairness, or utility-based fairness objectives to steer the learned TXOP adjustment policy.

If this is right

Absolute fairness policy keeps Jain index above 0.9 but limits total throughput.
Moderate fairness policy raises aggregate throughput by 68.22 percent relative to the strict case.
Utility-based policy raises overall utility by 177.6 percent relative to absolute fairness.
Operators can select any of the three operating points simply by swapping the reward function without retraining the underlying DQN.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the MDP assumption holds in hardware, the same reward-design method could be applied to other heterogeneous unlicensed-band problems such as future 6G sharing scenarios.
Policy generalization would need explicit tests under traffic loads and interference levels absent from the original simulations.
Separating policy specification from the learning loop may let the approach combine with multi-agent methods when several NR-U and Wi-Fi nodes interact simultaneously.

Load-bearing premise

The real coexistence dynamics between NR-U and Wi-Fi can be faithfully captured as a Markov decision process whose state and reward signals let a DQN learn stable policies that generalize past the simulated training scenarios.

What would settle it

Deploy the trained DQN policies in a new simulation that uses Wi-Fi traffic patterns or node densities outside the original training distribution and check whether the Jain fairness index falls below 0.9 or the reported throughput and utility gains vanish.

Figures

Figures reproduced from arXiv: 2605.00457 by Chiapin Wang, Po-Heng Chou, Shou-Yu Chen, Yi-Fang Yu.

**Figure 1.** Figure 1: FIGURE 1: Markov chain representation of Wi-Fi view at source ↗

**Figure 2.** Figure 2: FIGURE 2: Markov chain representation of NR-U LBT view at source ↗

**Figure 3.** Figure 3: FIGURE 3: System-level DRL-based TXOP control framework for NR-U/Wi-Fi coexistence, where the NR-U TXOP view at source ↗

**Figure 4.** Figure 4: FIGURE 4: Convergence behavior of different learning schemes in terms of training steps over fixed episodes ( view at source ↗

**Figure 5.** Figure 5: FIGURE 5: Throughput of NR-U, Wi-Fi under the default view at source ↗

**Figure 6.** Figure 6: FIGURE 6: Throughput of NR-U and Wi-Fi under the pro view at source ↗

**Figure 9.** Figure 9: FIGURE 9: Throughput of NR-U and Wi-Fi under MAB [23] view at source ↗

**Figure 11.** Figure 11: FIGURE 11: Average utility comparison among LBT, Q1, view at source ↗

**Figure 12.** Figure 12: FIGURE 12: Throughput fairness comparison among LBT, view at source ↗

**Figure 13.** Figure 13: FIGURE 13: Utility fairness comparison among LBT, Q1, view at source ↗

read the original abstract

The coexistence of NR-U and Wi-Fi in unlicensed spectrum introduces a challenging coexistence management problem, where heterogeneous channel access mechanisms lead to a significant imbalance in spectrum utilization and degraded Wi-Fi performance. To address this challenge, we propose a policy-driven deep reinforcement learning (DRL) framework for adaptive transmission opportunity (TXOP) control, in which the coexistence process is formulated as a Markov decision process (MDP) and a deep Q-network (DQN) learns control policies through online interaction. A key contribution is the introduction of a policy layer via reward design, enabling explicit control of coexistence tradeoffs among fairness, throughput, and utility. Three policies, namely absolute fairness, moderate fairness, and utility-based fairness, are developed to achieve different operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared to absolute fairness, moderate fairness improves aggregate throughput by 68.22%, while the utility-based policy further enhances utility by 177.6%. These results demonstrate that policy-driven control provides a flexible and effective solution for managing tradeoffs in heterogeneous coexistence networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard DQN with three reward-shaped policies lets operators tune fairness versus throughput in NR-U/Wi-Fi TXOP control, but the gains rest on simulations whose robustness outside the training conditions is unclear.

read the letter

The paper shows how to use reward design inside a DQN to hit different operating points for fairness, aggregate throughput, and utility in NR-U and Wi-Fi coexistence. They set up the channel access problem as an MDP and train policies for absolute fairness, moderate fairness, and a utility-based variant. The simulations report a Jain index above 0.9 under strict control, a 68 percent throughput lift for the moderate policy, and a 177 percent utility gain for the third one. That gives a direct demonstration of the tradeoff knob they built.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a policy-driven deep reinforcement learning (DRL) framework for adaptive TXOP control in NR-U/Wi-Fi coexistence. The problem is formulated as a Markov decision process (MDP) solved via deep Q-network (DQN), with a policy layer implemented through reward design to explicitly control tradeoffs among fairness, throughput, and QoS. Three policies (absolute fairness, moderate fairness, utility-based fairness) are developed and evaluated in simulation, yielding a Jain fairness index above 0.9 under strict control, a 68.22% aggregate throughput gain for moderate fairness versus absolute, and a 177.6% utility improvement for the utility-based policy.

Significance. If the results hold under broader conditions, the work demonstrates a practical mechanism for system-level tradeoff management in heterogeneous unlicensed-spectrum networks by using reward engineering to steer DQN policies toward different operating points. This is potentially significant for NR-U deployments, as it offers more flexibility than fixed-parameter listen-before-talk mechanisms. The concrete simulation metrics and explicit policy definitions via rewards are strengths that could inform future coexistence designs, though the absence of baselines and generalization tests limits immediate applicability.

major comments (3)

[Simulation Results] Simulation Results section: the reported gains (68.22% throughput, 177.6% utility) and Jain index >0.9 are presented without any baseline comparisons (e.g., to standard NR-U LBT or prior DRL coexistence schemes), statistical error bars, or details on the number of independent runs and random seeds. This makes it impossible to judge whether the numerical improvements are robust or merely artifacts of the chosen scenarios.
[MDP Formulation] MDP Formulation and State Representation: the state is described only at a high level (channel occupancy, queue lengths, interference metrics) without specifying whether it includes sufficient history to satisfy the Markov property under partial observability (hidden terminals, time-varying fading, asynchronous arrivals). Because the central claim rests on DQN learning stable policies that achieve the reported fairness/throughput points, this omission is load-bearing.
[Evaluation] Evaluation and Generalization: all results are confined to the training distribution (fixed node count, traffic load, channel model). No out-of-distribution tests (varying node density, bursty traffic, or different fading parameters) are reported, which directly undermines the claim that the policy-driven framework provides reliable tradeoff control beyond the simulated conditions.

minor comments (2)

[Abstract] Abstract and §1: the phrase 'online interaction' should be clarified; the text appears to describe simulation-based training rather than live network deployment.
[Policy Design] Reward definitions: the three policies are introduced via reward design, but the exact functional forms and weighting parameters should be given explicitly (perhaps in a table) so that the 'parameter-free' aspects of the tradeoff control can be verified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments on our work. We address each of the major comments point by point below, outlining the revisions we plan to incorporate in the updated manuscript.

read point-by-point responses

Referee: [Simulation Results] Simulation Results section: the reported gains (68.22% throughput, 177.6% utility) and Jain index >0.9 are presented without any baseline comparisons (e.g., to standard NR-U LBT or prior DRL coexistence schemes), statistical error bars, or details on the number of independent runs and random seeds. This makes it impossible to judge whether the numerical improvements are robust or merely artifacts of the chosen scenarios.

Authors: We agree that the absence of baseline comparisons and statistical details in the Simulation Results section limits the assessment of robustness. In the revised manuscript, we will add comparisons to the standard NR-U LBT mechanism as well as to existing DRL-based coexistence schemes from the literature. We will also specify that all results are averaged over 10 independent runs with different random seeds and include error bars representing the standard deviation for the key metrics such as throughput, utility, and Jain fairness index. revision: yes
Referee: [MDP Formulation] MDP Formulation and State Representation: the state is described only at a high level (channel occupancy, queue lengths, interference metrics) without specifying whether it includes sufficient history to satisfy the Markov property under partial observability (hidden terminals, time-varying fading, asynchronous arrivals). Because the central claim rests on DQN learning stable policies that achieve the reported fairness/throughput points, this omission is load-bearing.

Authors: The manuscript provides a high-level description of the state. To ensure the Markov property is adequately addressed, we will revise the MDP Formulation section to include a more detailed specification of the state vector. This will explicitly state the inclusion of historical channel occupancy over the past few time slots to account for partial observability due to hidden terminals, time-varying fading, and asynchronous packet arrivals. We maintain that the current state design enables the DQN to learn stable policies, but the expanded description will clarify how it satisfies the necessary conditions for the MDP. revision: yes
Referee: [Evaluation] Evaluation and Generalization: all results are confined to the training distribution (fixed node count, traffic load, channel model). No out-of-distribution tests (varying node density, bursty traffic, or different fading parameters) are reported, which directly undermines the claim that the policy-driven framework provides reliable tradeoff control beyond the simulated conditions.

Authors: We acknowledge that the current evaluation is restricted to the training scenarios. In the revised version, we will include additional experiments for out-of-distribution generalization. Specifically, we will test the trained policies under varying node densities, bursty traffic loads, and different channel fading parameters to demonstrate the robustness of the policy-driven tradeoff control. These results will be presented to support the broader applicability of the framework. revision: yes

Circularity Check

0 steps flagged

No circularity: simulation results from explicitly designed rewards

full rationale

The paper formulates NR-U/Wi-Fi coexistence as an MDP and trains a DQN agent whose policies are shaped by three author-specified reward functions (absolute fairness, moderate fairness, utility-based). The reported metrics (Jain index >0.9, 68.22% throughput gain, 177.6% utility gain) are direct simulation outputs under those rewards. No equation reduces a claimed prediction to a fitted parameter by construction, no load-bearing self-citation is invoked, and no uniqueness theorem or ansatz is smuggled in. The derivation chain is therefore self-contained: the framework proposes a control method and validates it empirically in simulation without definitional looping.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an MDP formulation plus reward engineering suffices to control system-level metrics; no new physical entities or unproven mathematical axioms are introduced.

free parameters (1)

DQN hyperparameters and reward weights
Learning rate, discount factor, and the three policy-specific reward coefficients are chosen to produce the reported operating points.

axioms (1)

domain assumption Coexistence dynamics admit a Markovian state representation sufficient for stable policy learning
Invoked when the problem is cast as an MDP without proving that partial observability or non-stationarity is negligible.

pith-pipeline@v0.9.0 · 5524 in / 1329 out tokens · 33204 ms · 2026-05-09T19:00:02.200119+00:00 · methodology

Policy-Driven DRL-Based TXOP Adaptation in NR-U and Wi-Fi Coexistence

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)