Learning to Cope with Adversarial Attacks

Aaron Havens; Girish Chowdhary; Soumik Sarkar; Xian Yeow Lee

arxiv: 1906.12061 · v1 · pith:GLGF2AN5new · submitted 2019-06-28 · 💻 cs.LG · cs.CR· stat.ML

Learning to Cope with Adversarial Attacks

Xian Yeow Lee , Aaron Havens , Girish Chowdhary , Soumik Sarkar This is my paper

Pith reviewed 2026-05-25 13:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CRstat.ML

keywords deep reinforcement learningadversarial attacksmeta-learninghierarchical policiesrobustnessonline adaptationcyber-physical systems

0 comments

The pith

A meta-learned hierarchical RL agent adapts its master and sub-policies online to sustain nominal rewards under adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the Meta-Learned Advantage Hierarchy agent as a way for deep reinforcement learning systems to handle adversarial intrusions in real-world settings such as cyber-physical systems. It tests the agent's responses to varied attack patterns and shows that the meta-learning structure lets the agent switch between policies to keep performance steady. Results also indicate that wider gaps between attacks allow the agent to reach better overall reward levels, even as variability rises. The work focuses on whether these adaptation patterns stem directly from the hierarchical meta-learning design.

Core claim

The MLAH agent exhibits interesting coping behaviors when subjected to different adversarial attacks to maintain a nominal reward. Additionally, the framework exhibits a hierarchical coping capability, based on the adaptability of the Master policy and sub-policies themselves. From empirical results, we also observed that as the interval of adversarial attacks increase, the MLAH agent can maintain a higher distribution of rewards, though at the cost of higher instabilities.

What carries the argument

The Meta-Learned Advantage Hierarchy (MLAH) agent, a meta-learning framework that learns robust policies online through a master policy and adaptable sub-policies.

If this is right

The agent can adjust its master policy for broad responses and sub-policies for finer adjustments during attacks.
Longer intervals between attacks support higher average rewards across trials.
Increased attack spacing also raises instability in the observed reward distribution.
Online adaptation occurs without requiring offline retraining after each attack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hierarchical meta-learning structures might be tested on other reinforcement learning vulnerabilities beyond the specific attacks used here.
Deployment in actual physical systems would clarify whether the simulated attack effects match real threats.
The instability trade-off could be measured against simpler non-hierarchical agents to isolate the source of the effect.

Load-bearing premise

The coping behaviors and reward patterns arise specifically from the MLAH meta-learning and hierarchical design rather than from other details of the training or attack setup.

What would settle it

Run the same attack experiments on a non-meta-learned hierarchical RL agent and compare whether the reward-maintenance and adaptation patterns disappear.

Figures

Figures reproduced from arXiv: 1906.12061 by Aaron Havens, Girish Chowdhary, Soumik Sarkar, Xian Yeow Lee.

**Figure 1.** Figure 1: Illustration of the MLAH framework. The Master policy observes the advantages of each sub-policy and decides the optimal sub-policy to employ. The selected sub-policy then acts on the observation from the environment. Note that both the Master policy and selected sub-policy receives the same reward signal from the environment. observation as the surrogate observation when adversaries are detected. Furtherm… view at source ↗

**Figure 2.** Figure 2: Illustration of a symmetric mirror attack on the RL agent about the center vertical axis. Under this attack, the optimal policy changes and the resulting action isn’t just sub-optimal but is instead directly leading the agent away from the goal. two sub-policies for nominal/adversary conditions. Each sub-policy consists of another separate network with 2 dense layers and 32 hidden units each. The sub-polic… view at source ↗

**Figure 3.** Figure 3: Comparison of a nominal agent with just one policy with the MLAH agent across multiple adversary attacks. The performance of the nominal agent are shown in red and the rewards clearly show a periodic presence of adversarial attacks. Performance of the MLAH (across different random seeds) are shown in cyan and there is a clear trend that the MLAH agent is able to cope against the adversarial attacks to main… view at source ↗

**Figure 4.** Figure 4: Illustration of MLAH agent’s coping behaviour under symmetrical mirror attack about the y-axis. The MLAH agent learns to use a different sub-policy that maps the adversarial observation to an optimal action that leads it to the goal. of 10,000 steps. As observed in the first sub-plot of each graph, the agent is able to reach the goal under nominal conditions in less than 10 iterations and consistently rec… view at source ↗

**Figure 5.** Figure 5: Cumulative reward plots of MLAH agent subjected to different intervals of adversarial mirror attacks. A noticeable trend is that as the intervals get smaller, the agent becomes more stable, though at a cost of a lower distribution of rewards with greater variance, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of cumulative rewards for the MLAH agent subjected to adversarial mirror attacks with different intervals. Under long intervals of attacks, the MLAH agent has a higher distribution of rewards, albeit with several outlying points attributed to Master agent’s instability. As attack intervals decrease, there are fewer instabilities as evident by fewer outliers, although the distribution of rewar… view at source ↗

read the original abstract

The security of Deep Reinforcement Learning (Deep RL) algorithms deployed in real life applications are of a primary concern. In particular, the robustness of RL agents in cyber-physical systems against adversarial attacks are especially vital since the cost of a malevolent intrusions can be extremely high. Studies have shown Deep Neural Networks (DNN), which forms the core decision-making unit in most modern RL algorithms, are easily subjected to adversarial attacks. Hence, it is imperative that RL agents deployed in real-life applications have the capability to detect and mitigate adversarial attacks in an online fashion. An example of such a framework is the Meta-Learned Advantage Hierarchy (MLAH) agent that utilizes a meta-learning framework to learn policies robustly online. Since the mechanism of this framework are still not fully explored, we conducted multiple experiments to better understand the framework's capabilities and limitations. Our results shows that the MLAH agent exhibits interesting coping behaviors when subjected to different adversarial attacks to maintain a nominal reward. Additionally, the framework exhibits a hierarchical coping capability, based on the adaptability of the Master policy and sub-policies themselves. From empirical results, we also observed that as the interval of adversarial attacks increase, the MLAH agent can maintain a higher distribution of rewards, though at the cost of higher instabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports empirical observations of an MLAH agent adapting to adversarial attacks but does not isolate whether the meta-learning or hierarchy drives the claimed coping behaviors.

read the letter

The paper extends the MLAH meta-learning framework to adversarial settings in RL for cyber-physical systems. It describes observed behaviors where the agent maintains nominal rewards under attacks, shows adaptability in the master policy and sub-policies, and achieves higher reward distributions with longer attack intervals at the expense of more instability. These are presented as new empirical results not covered in prior MLAH work. That extension to attacks is the concrete addition here, and the focus on online robustness for safety-critical applications is a reasonable direction given the stakes in real deployments. The work is grounded in the existing MLAH literature rather than inventing new mechanisms. The main limitation is the lack of isolating experiments. The abstract and claims tie the coping and hierarchical effects directly to the MLAH structure, yet no comparisons to flat meta-learners, non-hierarchical agents, or standard RL baselines under identical attack schedules are mentioned. Without those controls or statistical checks, the attribution remains an assumption rather than a demonstrated result, which matches the stress-test note. Scope is also narrow to this one framework. Readers working on robust RL for physical systems might pick up ideas on adaptation patterns, but the paper will not settle broader questions about adversarial ML. It deserves peer review because the topic is relevant and the observations are at least reproducible in principle if the full experiments are detailed; a referee could push for the missing baselines and clarify the causal claims.

Referee Report

1 major / 1 minor

Summary. The paper examines the Meta-Learned Advantage Hierarchy (MLAH) agent in deep reinforcement learning for robustness against adversarial attacks in cyber-physical systems. It reports that MLAH exhibits coping behaviors to sustain nominal rewards under varying attacks, demonstrates hierarchical adaptability via the master policy and sub-policies, and achieves higher reward distributions as attack intervals lengthen, at the expense of increased instability. These outcomes are presented as empirical findings from multiple experiments exploring the framework's capabilities and limitations.

Significance. If the reported behaviors can be causally attributed to the meta-learning and hierarchical structure rather than generic RL adaptation, the work could inform design of online-robust agents for high-stakes applications. However, the absence of isolating controls means the significance remains provisional; the manuscript does not yet deliver reproducible evidence that the observed coping or interval effects are architecture-specific.

major comments (1)

[Abstract / Experiments section] Abstract and experimental claims: the central attribution—that coping behaviors, hierarchical adaptability, and reward distributions arise from MLAH's meta-learning and Master/sub-policy structure—lacks supporting controls. No comparisons to flat meta-learners, non-meta hierarchical agents, or standard DQN/PPO under identical attack schedules are described, leaving the causal link to the architecture unsecured (see reader's weakest_assumption and skeptic note).

minor comments (1)

[Abstract] The abstract states results but provides no quantitative details (e.g., specific reward values, attack types, interval lengths, or statistical measures), making it impossible to assess reproducibility or effect sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment correctly identifies that our experiments do not include control comparisons to other agents, which limits strong causal claims about the source of the observed behaviors. We respond point by point below.

read point-by-point responses

Referee: [Abstract / Experiments section] Abstract and experimental claims: the central attribution—that coping behaviors, hierarchical adaptability, and reward distributions arise from MLAH's meta-learning and Master/sub-policy structure—lacks supporting controls. No comparisons to flat meta-learners, non-meta hierarchical agents, or standard DQN/PPO under identical attack schedules are described, leaving the causal link to the architecture unsecured (see reader's weakest_assumption and skeptic note).

Authors: We agree that the manuscript contains no ablation or baseline comparisons (flat meta-learners, non-meta hierarchies, or standard DQN/PPO) under matched attack schedules. The work is an observational study of behaviors exhibited by the MLAH agent; the abstract and text attribute the reported coping and interval effects to the MLAH framework as implemented, without claiming these effects are absent in other architectures. Because new comparative experiments were not performed, we cannot supply the requested isolating controls. We will therefore revise the abstract and add an explicit limitations paragraph stating that the findings are specific to MLAH and that causal isolation of meta-learning versus hierarchy remains future work. This is a partial revision consisting of textual clarification rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observations only

full rationale

The paper reports experimental results on MLAH agent behavior under adversarial attacks, with all claims (coping behaviors, hierarchical adaptability, reward distributions) explicitly tied to simulation outcomes rather than any derivation, prediction, or first-principles argument. No equations, fitted parameters presented as predictions, or self-citation load-bearing steps appear in the provided text. The analysis is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard domain assumptions about DNN vulnerability to adversarial attacks in RL; no free parameters, new entities, or ad-hoc axioms are mentioned.

axioms (1)

domain assumption Deep neural networks forming the core of RL agents are vulnerable to adversarial attacks
Stated as background motivation in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 1182 out tokens · 29553 ms · 2026-05-25T13:48:36.569428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the MLAH agent that utilizes a meta-learning framework to learn policies robustly online... Master policy that detects whether an adversary is present through the advantages of sub-policies
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical coping capability, based on the adaptability of the Master policy and sub-policies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 11 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

and Munir, A

Behzadan, V. and Munir, A. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pp.\ 262--275. Springer, 2017 a

work page 2017
[3]

Whatever Does Not Kill Deep Reinforcement Learning, Makes It Stronger

Behzadan, V. and Munir, A. Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Deep direct reinforcement learning for financial signal representation and trading

Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, 28 0 (3): 0 653--664, 2017

work page 2017
[6]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Meta Learning Shared Hierarchies

Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Explaining and Harnessing Adversarial Examples

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Online robust policy learning in the presence of unknown adversaries

Havens, A., Jiang, Z., and Sarkar, S. Online robust policy learning in the presence of unknown adversaries. In Advances in Neural Information Processing Systems, pp.\ 9916--9926, 2018

work page 2018
[10]

Adversarial Attacks on Neural Network Policies

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

W., Morton, E

Jaccard, N., Rogers, T. W., Morton, E. J., and Griffin, L. D. Automated detection of smuggled high-risk security threats using deep learning. 2016

work page 2016
[12]

A., Badawi, O., Gordon, A

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., and Faisal, A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24 0 (11): 0 1716, 2018

work page 2018
[13]

Adversarial examples in the physical world

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016 a

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Adversarial Machine Learning at Scale

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016 b

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Flow Shape Design for Microfluidic Devices Using Deep Reinforcement Learning

Lee, X. Y., Balu, A., Stoecklein, D., Ganapathysubramanian, B., and Sarkar, S. Flow shape design for microfluidic devices using deep reinforcement learning. arXiv preprint arXiv:1811.12444, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Tactics of adversarial attack on deep reinforcement learning agents

Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp.\ 3756--3762. AAAI Press, 2017 a

work page 2017
[17]

Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight

Lin, Y.-C., Liu, M.-Y., Sun, M., and Huang, J.-B. Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529, 2015

work page 2015
[19]

Neftci, E. O. and Averbeck, B. B. Reinforcement learning in artificial and biological systems. Environment, pp.\ 3, 2002

work page 2002
[20]

Refuel: Exploring sparse features in deep reinforcement learning for fast disease diagnosis

Peng, Y.-S., Tang, K.-F., Lin, H.-T., and Chang, E. Refuel: Exploring sparse features in deep reinforcement learning for fast disease diagnosis. In Advances in Neural Information Processing Systems, pp.\ 7322--7331, 2018

work page 2018
[21]

Rausch, V., Hansen, A., Solowjow, E., Liu, C., Kreuzer, E., and Hedrick, J. K. Learning a deep neural net policy for end-to-end control of autonomous vehicles. In 2017 American Control Conference (ACC), pp.\ 4914--4919. IEEE, 2017

work page 2017
[22]

J., and Fritz, M

Tretschk, E., Oh, S. J., and Fritz, M. Sequential attacks on agents for long-term adversarial goals. In 2. ACM Computer Science in Cars Symposium, 2018

work page 2018
[23]

Neural Architecture Search with Reinforcement Learning

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

and Munir, A

Behzadan, V. and Munir, A. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pp.\ 262--275. Springer, 2017 a

work page 2017

[3] [3]

Whatever Does Not Kill Deep Reinforcement Learning, Makes It Stronger

Behzadan, V. and Munir, A. Whatever does not kill deep reinforcement learning, makes it stronger. arXiv preprint arXiv:1712.09344, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Deep direct reinforcement learning for financial signal representation and trading

Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, 28 0 (3): 0 653--664, 2017

work page 2017

[6] [6]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine, S. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Meta Learning Shared Hierarchies

Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Explaining and Harnessing Adversarial Examples

Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

Online robust policy learning in the presence of unknown adversaries

Havens, A., Jiang, Z., and Sarkar, S. Online robust policy learning in the presence of unknown adversaries. In Advances in Neural Information Processing Systems, pp.\ 9916--9926, 2018

work page 2018

[10] [10]

Adversarial Attacks on Neural Network Policies

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

W., Morton, E

Jaccard, N., Rogers, T. W., Morton, E. J., and Griffin, L. D. Automated detection of smuggled high-risk security threats using deep learning. 2016

work page 2016

[12] [12]

A., Badawi, O., Gordon, A

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., and Faisal, A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24 0 (11): 0 1716, 2018

work page 2018

[13] [13]

Adversarial examples in the physical world

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016 a

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Adversarial Machine Learning at Scale

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016 b

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Flow Shape Design for Microfluidic Devices Using Deep Reinforcement Learning

Lee, X. Y., Balu, A., Stoecklein, D., Ganapathysubramanian, B., and Sarkar, S. Flow shape design for microfluidic devices using deep reinforcement learning. arXiv preprint arXiv:1811.12444, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

Tactics of adversarial attack on deep reinforcement learning agents

Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp.\ 3756--3762. AAAI Press, 2017 a

work page 2017

[17] [17]

Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight

Lin, Y.-C., Liu, M.-Y., Sun, M., and Huang, J.-B. Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529, 2015

work page 2015

[19] [19]

Neftci, E. O. and Averbeck, B. B. Reinforcement learning in artificial and biological systems. Environment, pp.\ 3, 2002

work page 2002

[20] [20]

Refuel: Exploring sparse features in deep reinforcement learning for fast disease diagnosis

Peng, Y.-S., Tang, K.-F., Lin, H.-T., and Chang, E. Refuel: Exploring sparse features in deep reinforcement learning for fast disease diagnosis. In Advances in Neural Information Processing Systems, pp.\ 7322--7331, 2018

work page 2018

[21] [21]

Rausch, V., Hansen, A., Solowjow, E., Liu, C., Kreuzer, E., and Hedrick, J. K. Learning a deep neural net policy for end-to-end control of autonomous vehicles. In 2017 American Control Conference (ACC), pp.\ 4914--4919. IEEE, 2017

work page 2017

[22] [22]

J., and Fritz, M

Tretschk, E., Oh, S. J., and Fritz, M. Sequential attacks on agents for long-term adversarial goals. In 2. ACM Computer Science in Cars Symposium, 2018

work page 2018

[23] [23]

Neural Architecture Search with Reinforcement Learning

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016