Deep Reinforcement Learning for Cognitive Time-Division Joint SAR and Secure Communications

Anke Schmeink; Ata Khalili; Mohamed-Amine Lahmeri; Robert Schober; Yujiao Liu

arxiv: 2604.09978 · v1 · submitted 2026-04-11 · 💻 cs.IT · cs.SY· eess.SY· math.IT

Deep Reinforcement Learning for Cognitive Time-Division Joint SAR and Secure Communications

Mohamed-Amine Lahmeri , Ata Khalili , Yujiao Liu , Anke Schmeink , Robert Schober This is my paper

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.IT cs.SYeess.SYmath.IT

keywords joint SAR and communicationssecure communicationsdeep reinforcement learningtime division multiplexingeavesdropper trackingsecrecy rate optimizationalong-track interferometryaerial base station

0 comments

The pith

Deep reinforcement learning optimizes time and power allocation in a time-division joint SAR and secure communication system to maximize the worst-case secrecy rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a dynamic time-division joint SAR and communication framework where an aerial base station uses cognitive SAR along-track interferometry to track a moving eavesdropper and then applies adaptive beamforming plus artificial noise to secure downlink transmissions to a ground user. It casts the joint optimization of sensing and communication time slots together with power allocation as a Markov decision process and solves it with deep reinforcement learning. This setup matters for critical aerial scenarios such as surveillance or disaster response, where conventional localization of adversaries fails and fixed time splits cannot guarantee both imaging quality and secrecy. Simulations show the learned policy outperforms equal-aperture and random-allocation baselines while satisfying SAR and rate constraints. The same policy also performs well on eavesdropper motion patterns absent from training.

Core claim

The central claim is that a cognitive time-division joint SAR and communication system, in which an aerial base station estimates eavesdropper position and velocity via along-track interferometry, formulates the resulting worst-case secrecy-rate maximization as a Markov decision process, and solves it by deep reinforcement learning, yields higher secrecy rates than fixed or random time-allocation schemes while meeting both SAR imaging and communication constraints.

What carries the argument

The Markov decision process whose state captures the eavesdropper trajectory estimated by cognitive SAR along-track interferometry and whose actions are the time and power allocations between SAR and secure communication phases.

If this is right

The learned policy achieves higher worst-case secrecy rates than both learning and non-learning baselines that use equal-aperture or random time allocation.
The same policy generalizes to previously unseen eavesdropper motion patterns without retraining.
Joint optimization of time and power satisfies the SAR imaging quality and communication rate constraints simultaneously.
Adaptive beamforming and artificial-noise jamming, driven by the SAR-derived estimates, improve secrecy against a ground-moving eavesdropper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could scale to multiple simultaneous users or eavesdroppers by expanding the MDP state and action spaces, provided training remains tractable.
Because the method relies on SAR-derived motion estimates, combining it with complementary sensors such as optical or inertial data could reduce sensitivity to SAR-specific error sources.
If the DRL agent is trained only in simulation, real-world transfer would require calibration of the channel and motion models to close the domain gap.

Load-bearing premise

Cognitive SAR along-track interferometry produces position and velocity estimates of the eavesdropper that are accurate enough for adaptive beamforming and artificial-noise jamming to deliver the claimed secrecy gains.

What would settle it

A test in which realistic SAR estimation errors cause the learned policy to produce lower secrecy rates than the equal-aperture baseline, or in which the policy fails to generalize to new eavesdropper trajectories, would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2604.09978 by Anke Schmeink, Ata Khalili, Mohamed-Amine Lahmeri, Robert Schober, Yujiao Liu.

**Figure 2.** Figure 2: Proposed dynamic TD JSARC framework. For illustration, only selected time frames (first, second, and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Top-view illustration of the evolution of the eavesdropper uncertainty region with radius [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Learned policy adaptation to unseen linear eavesdropper trajectory with oscillating speed (between [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Average worst-case secrecy rate versus eavesdropper speed. The eavesdropper moves along a circular [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Synthetic aperture radar (SAR) imaging can be exploited to enhance wireless communication performance through high-precision environmental awareness. However, integrating sensing and communication functionalities in such wideband systems remains challenging, motivating the development of a joint SAR and communication (JSARC) framework. We propose a dynamic time-division JSARC (TD-JSARC) framework for secure aerial communications that is relevant for critical scenarios, such as surveillance or post-disaster communication, where conventional localization of mobile adversaries often fails. In particular, we consider a secure downlink communication scenario where an aerial base station (ABS) serves a ground user (UE) in the presence of a ground-moving eavesdropper. To detect and track the eavesdropper, the ABS uses cognitive SAR along-track interferometry (ATI) to estimate its position and velocity. Based on these estimates, the ABS applies adaptive beamforming and artificial-noise jamming to enhance secrecy. To this end, we jointly optimize the time and power allocation to maximize the worst-case secrecy rate, while satisfying both SAR and communication constraints. Using the estimated eavesdropper trajectory, we formulate the problem as a Markov decision process (MDP) and solve it via deep reinforcement learning (DRL). Simulation results show that the proposed learning-based approach outperforms both learning and non-learning baseline schemes employing equal-aperture and random time allocation. The proposed method also generalizes well to previously unseen eavesdropper motion patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines cognitive SAR ATI tracking with DRL for time-power allocation in secure aerial comms and reports simulation gains over baselines, but the secrecy results rest on unexamined accuracy of the sensing estimates.

read the letter

The main point here is a DRL solver for dynamic time and power splits in a time-division joint SAR and secure communication setup. An aerial base station uses cognitive SAR along-track interferometry to track a ground eavesdropper in real time, then applies adaptive beamforming and artificial noise based on those estimates while maximizing worst-case secrecy rate subject to SAR imaging and communication constraints. The MDP is built directly from the physical model and solved with DRL, with simulations showing better performance than equal-aperture and random allocation schemes plus some generalization to unseen eavesdropper trajectories.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a time-division joint SAR and secure communications (TD-JSARC) framework for an aerial base station serving a ground user in the presence of a ground-moving eavesdropper. Cognitive SAR along-track interferometry (ATI) is used to estimate the eavesdropper's position and velocity, which inform adaptive beamforming and artificial-noise jamming. The joint time and power allocation problem is formulated to maximize the worst-case secrecy rate subject to SAR imaging and communication constraints, cast as a Markov decision process (MDP), and solved via deep reinforcement learning (DRL). Simulation results are presented claiming that the DRL approach outperforms both learning and non-learning baselines using equal-aperture and random time allocation, and that the learned policy generalizes to previously unseen eavesdropper motion patterns.

Significance. If the results hold under realistic sensing conditions, the work advances integrated sensing and communications (ISAC) for physical-layer security in dynamic aerial scenarios by coupling SAR-based adversary tracking with DRL-driven resource allocation. Credit is given for the reported generalization to unseen motion patterns, which provides evidence that the policy captures transferable structure rather than memorizing specific trajectories. The simulation-based outperformance over multiple baselines is a concrete strength, though its weight depends on validation of the underlying ATI accuracy assumption.

major comments (2)

[MDP formulation and simulation setup] The MDP formulation (described in the system model and problem formulation sections) incorporates ATI-derived eavesdropper position and velocity estimates directly into the state for adaptive beamforming and AN jamming to achieve the worst-case secrecy rate. No error model is introduced for ATI inaccuracies (e.g., clutter, phase noise, or along-track velocity ambiguity), and no sensitivity analysis quantifies how secrecy rate and policy performance degrade under realistic estimation errors. This assumption is load-bearing for both the outperformance and generalization claims, as optimistic sensing inputs would render the reported gains artifacts of the simulation channel model rather than robust outcomes of the joint design.
[Simulation results] The simulation results (abstract and results section) report outperformance and generalization but omit key details required for assessment: the specific DRL architecture (e.g., actor-critic network type, layer sizes), full constraint formulations in the MDP reward and state transitions, hyperparameter values beyond learning rate and discount factor, and statistical validation (e.g., number of independent runs, confidence intervals, or variance across random seeds). Without these, it is not possible to determine whether the gains are reproducible or sensitive to simulation artifacts.

minor comments (2)

[System model] Notation for time-division parameters and secrecy rate expressions could be introduced more explicitly with a dedicated table of symbols to improve readability.
[Results] Figure captions in the results section should specify the exact simulation parameters (e.g., SNR ranges, eavesdropper velocity distributions) used for each curve to allow direct comparison with the baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point by point below, providing clarifications on the current manuscript and committing to specific revisions that strengthen the work without misrepresenting our contributions.

read point-by-point responses

Referee: [MDP formulation and simulation setup] The MDP formulation incorporates ATI-derived eavesdropper position and velocity estimates directly into the state without an error model for ATI inaccuracies (e.g., clutter, phase noise, or along-track velocity ambiguity), and no sensitivity analysis quantifies how secrecy rate and policy performance degrade under realistic estimation errors. This assumption is load-bearing for both the outperformance and generalization claims.

Authors: We agree that robustness to ATI estimation errors is important for the claims. The current manuscript assumes ideal estimates to isolate the performance gains from the joint time/power optimization and DRL policy under perfect sensing, which is a standard initial approach in ISAC studies. However, we will revise the manuscript to include a dedicated sensitivity analysis subsection. This will model realistic ATI errors (e.g., additive Gaussian noise on position/velocity estimates with varying variances) and quantify degradation in worst-case secrecy rate, along with how the learned policy performs under noisy states. We will also add discussion on potential robustness enhancements, such as training the DRL agent with noisy observations. revision: yes
Referee: [Simulation results] The simulation results omit key details required for assessment: the specific DRL architecture (e.g., actor-critic network type, layer sizes), full constraint formulations in the MDP reward and state transitions, hyperparameter values beyond learning rate and discount factor, and statistical validation (e.g., number of independent runs, confidence intervals, or variance across random seeds).

Authors: We thank the referee for highlighting these reproducibility issues. The original submission emphasized high-level performance comparisons, but we will expand the simulation setup and results sections in the revised version. We will specify the exact DRL architecture (actor-critic networks with layer sizes and activations), provide the complete MDP reward function and all state transition details including constraints, list all hyperparameters (including batch size, exploration parameters, etc.), and report statistical validation with averages over 10 independent runs, including standard deviations and confidence intervals. These additions will enable full assessment and reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper derives the MDP directly from the physical JSARC system model (ATI-based eavesdropper estimation, adaptive beamforming, AN jamming, SAR/comms constraints) and applies standard DRL to solve the time/power allocation for worst-case secrecy rate. Simulation-based comparisons to equal-aperture and random baselines, plus generalization tests on unseen motion patterns, are external evaluations against the same model; they do not reduce by construction to a fitted quantity or self-defined input. No self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations appear in the abstract or described chain. The approach is a conventional model-based RL formulation that remains self-contained against its simulation benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper rests on standard wireless channel and SAR processing assumptions plus the MDP formulation for DRL. No new physical entities are introduced. Free parameters are typical DRL training choices not detailed in the abstract.

free parameters (1)

DRL hyperparameters including learning rate and discount factor
Standard tunable parameters required to train the agent on the formulated MDP.

axioms (1)

domain assumption The joint SAR sensing and secure communication dynamics can be accurately modeled as a Markov decision process with states based on eavesdropper estimates.
Invoked to enable application of DRL to the time and power allocation optimization.

pith-pipeline@v0.9.0 · 5575 in / 1296 out tokens · 80641 ms · 2026-05-10T16:37:16.187362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

A survey of physical layer security techniques for 5G wireless networks and challenges ahead,

Y . Wuet al., “A survey of physical layer security techniques for 5G wireless networks and challenges ahead,”IEEE J. Sel. Areas Commun., vol. 36, no. 4, pp. 679–695, 2018. 17

work page 2018
[2]

Physical layer security in UA V systems: Challenges and opportunities,

X. Sunet al., “Physical layer security in UA V systems: Challenges and opportunities,”IEEE Wireless Commun., vol. 26, no. 5, pp. 40–47, 2019

work page 2019
[3]

Securing UA V communications via joint trajectory and power control,

G. Zhang, Q. Wu, M. Cui, and R. Zhang, “Securing UA V communications via joint trajectory and power control,”IEEE Trans. Wireless Commun., vol. 18, no. 2, pp. 1376–1389, 2019

work page 2019
[4]

Securing UA V communication in the presence of stationary or mobile eavesdroppers via online 3D trajectory planning,

A. V . Savkin, H. Huang, and W. Ni, “Securing UA V communication in the presence of stationary or mobile eavesdroppers via online 3D trajectory planning,”IEEE Wireless Commun. Lett., vol. 9, no. 8, pp. 1211–1215, 2020

work page 2020
[5]

Joint trajectory and resource allocation design for energy-efficient secure UA V communication systems,

Y . Caiet al., “Joint trajectory and resource allocation design for energy-efficient secure UA V communication systems,” IEEE Trans. Commun., vol. 68, no. 7, pp. 4536–4553, 2020

work page 2020
[6]

M. I. Skolnik,Radar Handbook, 3rd ed. McGraw-Hill, 2008

work page 2008
[7]

UA V formation and resource allocation optimization for communication-assisted 3D InSAR sensing,

M.-A. Lahmeriet al., “UA V formation and resource allocation optimization for communication-assisted 3D InSAR sensing,” IEEE Trans. Commun., vol. 73, no. 8, pp. 5788–5804, 2025

work page 2025
[8]

Sensing accuracy optimization for communication-assisted dual-baseline UA V-InSAR,

——, “Sensing accuracy optimization for communication-assisted dual-baseline UA V-InSAR,” inProc. IEEE Int. Conf. Commun., 2025, pp. 6573–6578

work page 2025
[9]

Trajectory planning of cellular-connected UA V for communication-assisted radar sensing,

S. Hu, X. Yuan, W. Ni, and X. Wang, “Trajectory planning of cellular-connected UA V for communication-assisted radar sensing,”IEEE Trans. Commun., vol. 70, no. 9, pp. 6385–6396, 2022

work page 2022
[10]

Integrated sensing and communication for UA V-borne SAR systems,

Z. Liu, F. Zesonget al., “Integrated sensing and communication for UA V-borne SAR systems,” inInt. Symp. Commun. Inf. Technol., 2023, pp. 1–6

work page 2023
[11]

Joint user scheduling, power allocation, and trajectory design for joint SAR and communication UA V systems,

Z. Liuet al., “Joint user scheduling, power allocation, and trajectory design for joint SAR and communication UA V systems,”IEEE Trans. V eh. Technol., vol. 74, no. 2, pp. 3006–3016, 2025

work page 2025
[12]

Exploring ISAC technology for UA V SAR imaging,

S. Moroet al., “Exploring ISAC technology for UA V SAR imaging,” inIEEE Int. Conf. Commun., 2024, pp. 1582–1587

work page 2024
[13]

Cognitive radar: a way of the future,

S. Haykin, “Cognitive radar: a way of the future,”IEEE Signal Process. Mag., vol. 23, no. 1, pp. 30–40, 2006

work page 2006
[14]

A tutorial on synthetic aperture radar,

A. Moreiraet al., “A tutorial on synthetic aperture radar,”IEEE Geosci. Remote Sens. Mag., vol. 1, no. 1, pp. 6–43, 2013

work page 2013
[15]

Robust and Secure Resource Allocation for ISAC Systems: A Novel Optimization Framework for Variable- Length Snapshots,

D. Xuet al., “Robust and Secure Resource Allocation for ISAC Systems: A Novel Optimization Framework for Variable- Length Snapshots,”IEEE Trans. Commun., vol. 70, no. 12, pp. 8196–8214, 2022

work page 2022
[16]

Joint bi-static radar and communications designs for intelligent transportation,

N. Cao, Y . Chen, X. Gu, and W. Feng, “Joint bi-static radar and communications designs for intelligent transportation,” IEEE Trans. V eh. Technol., vol. 69, no. 11, pp. 13 060–13 071, 2020

work page 2020
[17]

Optimal Scheduling Policy for Time-Division Joint Radar and Communication Systems: Cross-Layer Design and Sensing for Free,

Z. Xieet al., “Optimal Scheduling Policy for Time-Division Joint Radar and Communication Systems: Cross-Layer Design and Sensing for Free,”IEEE Internet Things J., vol. 10, no. 23, pp. 20 746–20 760, 2023

work page 2023
[18]

Moving target detection by along-track interferometry,

V . Pascazio, G. Schirinzi, and A. Farina, “Moving target detection by along-track interferometry,” inProc. IEEE Int. Geosci. Remote Sens. Symp., vol. 7, 2001, pp. 3024–3026

work page 2001
[19]

Performance assessment of along-track interferometry for detecting ground moving targets,

C. W. Chen, “Performance assessment of along-track interferometry for detecting ground moving targets,” inProc. IEEE Radar Conf., 2004, pp. 99–104

work page 2004
[20]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press, 2018

work page 2018
[21]

Proximal Policy Optimization Algorithms

J. Schulmanet al., “Proximal policy optimization algorithms,”arXiv preprint:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

RLlib: Abstractions for distributed reinforcement learning,

E. Lianget al., “RLlib: Abstractions for distributed reinforcement learning,” inProc. 35th Int. Conf. Mach. Learn. (ICML), vol. 80. PMLR, 2018, pp. 3053–3062

work page 2018

[1] [1]

A survey of physical layer security techniques for 5G wireless networks and challenges ahead,

Y . Wuet al., “A survey of physical layer security techniques for 5G wireless networks and challenges ahead,”IEEE J. Sel. Areas Commun., vol. 36, no. 4, pp. 679–695, 2018. 17

work page 2018

[2] [2]

Physical layer security in UA V systems: Challenges and opportunities,

X. Sunet al., “Physical layer security in UA V systems: Challenges and opportunities,”IEEE Wireless Commun., vol. 26, no. 5, pp. 40–47, 2019

work page 2019

[3] [3]

Securing UA V communications via joint trajectory and power control,

G. Zhang, Q. Wu, M. Cui, and R. Zhang, “Securing UA V communications via joint trajectory and power control,”IEEE Trans. Wireless Commun., vol. 18, no. 2, pp. 1376–1389, 2019

work page 2019

[4] [4]

Securing UA V communication in the presence of stationary or mobile eavesdroppers via online 3D trajectory planning,

A. V . Savkin, H. Huang, and W. Ni, “Securing UA V communication in the presence of stationary or mobile eavesdroppers via online 3D trajectory planning,”IEEE Wireless Commun. Lett., vol. 9, no. 8, pp. 1211–1215, 2020

work page 2020

[5] [5]

Joint trajectory and resource allocation design for energy-efficient secure UA V communication systems,

Y . Caiet al., “Joint trajectory and resource allocation design for energy-efficient secure UA V communication systems,” IEEE Trans. Commun., vol. 68, no. 7, pp. 4536–4553, 2020

work page 2020

[6] [6]

M. I. Skolnik,Radar Handbook, 3rd ed. McGraw-Hill, 2008

work page 2008

[7] [7]

UA V formation and resource allocation optimization for communication-assisted 3D InSAR sensing,

M.-A. Lahmeriet al., “UA V formation and resource allocation optimization for communication-assisted 3D InSAR sensing,” IEEE Trans. Commun., vol. 73, no. 8, pp. 5788–5804, 2025

work page 2025

[8] [8]

Sensing accuracy optimization for communication-assisted dual-baseline UA V-InSAR,

——, “Sensing accuracy optimization for communication-assisted dual-baseline UA V-InSAR,” inProc. IEEE Int. Conf. Commun., 2025, pp. 6573–6578

work page 2025

[9] [9]

Trajectory planning of cellular-connected UA V for communication-assisted radar sensing,

S. Hu, X. Yuan, W. Ni, and X. Wang, “Trajectory planning of cellular-connected UA V for communication-assisted radar sensing,”IEEE Trans. Commun., vol. 70, no. 9, pp. 6385–6396, 2022

work page 2022

[10] [10]

Integrated sensing and communication for UA V-borne SAR systems,

Z. Liu, F. Zesonget al., “Integrated sensing and communication for UA V-borne SAR systems,” inInt. Symp. Commun. Inf. Technol., 2023, pp. 1–6

work page 2023

[11] [11]

Joint user scheduling, power allocation, and trajectory design for joint SAR and communication UA V systems,

Z. Liuet al., “Joint user scheduling, power allocation, and trajectory design for joint SAR and communication UA V systems,”IEEE Trans. V eh. Technol., vol. 74, no. 2, pp. 3006–3016, 2025

work page 2025

[12] [12]

Exploring ISAC technology for UA V SAR imaging,

S. Moroet al., “Exploring ISAC technology for UA V SAR imaging,” inIEEE Int. Conf. Commun., 2024, pp. 1582–1587

work page 2024

[13] [13]

Cognitive radar: a way of the future,

S. Haykin, “Cognitive radar: a way of the future,”IEEE Signal Process. Mag., vol. 23, no. 1, pp. 30–40, 2006

work page 2006

[14] [14]

A tutorial on synthetic aperture radar,

A. Moreiraet al., “A tutorial on synthetic aperture radar,”IEEE Geosci. Remote Sens. Mag., vol. 1, no. 1, pp. 6–43, 2013

work page 2013

[15] [15]

Robust and Secure Resource Allocation for ISAC Systems: A Novel Optimization Framework for Variable- Length Snapshots,

D. Xuet al., “Robust and Secure Resource Allocation for ISAC Systems: A Novel Optimization Framework for Variable- Length Snapshots,”IEEE Trans. Commun., vol. 70, no. 12, pp. 8196–8214, 2022

work page 2022

[16] [16]

Joint bi-static radar and communications designs for intelligent transportation,

N. Cao, Y . Chen, X. Gu, and W. Feng, “Joint bi-static radar and communications designs for intelligent transportation,” IEEE Trans. V eh. Technol., vol. 69, no. 11, pp. 13 060–13 071, 2020

work page 2020

[17] [17]

Optimal Scheduling Policy for Time-Division Joint Radar and Communication Systems: Cross-Layer Design and Sensing for Free,

Z. Xieet al., “Optimal Scheduling Policy for Time-Division Joint Radar and Communication Systems: Cross-Layer Design and Sensing for Free,”IEEE Internet Things J., vol. 10, no. 23, pp. 20 746–20 760, 2023

work page 2023

[18] [18]

Moving target detection by along-track interferometry,

V . Pascazio, G. Schirinzi, and A. Farina, “Moving target detection by along-track interferometry,” inProc. IEEE Int. Geosci. Remote Sens. Symp., vol. 7, 2001, pp. 3024–3026

work page 2001

[19] [19]

Performance assessment of along-track interferometry for detecting ground moving targets,

C. W. Chen, “Performance assessment of along-track interferometry for detecting ground moving targets,” inProc. IEEE Radar Conf., 2004, pp. 99–104

work page 2004

[20] [20]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press, 2018

work page 2018

[21] [21]

Proximal Policy Optimization Algorithms

J. Schulmanet al., “Proximal policy optimization algorithms,”arXiv preprint:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

RLlib: Abstractions for distributed reinforcement learning,

E. Lianget al., “RLlib: Abstractions for distributed reinforcement learning,” inProc. 35th Int. Conf. Mach. Learn. (ICML), vol. 80. PMLR, 2018, pp. 3053–3062

work page 2018