arxiv: 2109.13916 · v5 · submitted 2021-09-28 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Recognition: 1 theorem link

· Lean Theorem

Unsolved Problems in ML Safety

Dan Hendrycks , Nicholas Carlini , John Schulman , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords machine learning safetyrobustnessmonitoringalignmentsystemic safetyAI deployment riskshigh-stakes applications

0 comments

The pith

Machine learning safety should focus on four research areas as models scale and deploy in critical settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper lays out a roadmap for machine learning safety research by identifying four central problems. As ML systems grow larger, gain new abilities, and enter high-stakes applications, safety becomes essential to prevent harms similar to those seen with other powerful technologies. The authors break the challenges into robustness for handling hazards, monitoring for detecting them, alignment for lowering built-in risks in models, and systemic safety for addressing larger-scale issues. They provide concrete directions for each to guide future work.

Core claim

We present four problems ready for research, namely withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety). Throughout, we clarify each problem's motivation and provide concrete research directions.

What carries the argument

A four-category framework dividing ML safety into Robustness, Monitoring, Alignment, and Systemic Safety.

If this is right

Research can target concrete directions for withstanding hazards such as adversarial examples and distribution shifts.
Methods can be developed to identify hazards through uncertainty estimation and anomaly detection.
Work on alignment can reduce unintended behaviors by better matching model objectives to human intent.
Systemic safety efforts can address risks arising from widespread deployment and interactions with other systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This structure could help researchers classify ongoing projects and spot under-explored areas within the four categories.
It might support cross-disciplinary efforts by linking technical fixes to broader societal deployment concerns.
The framework could be revisited as new model capabilities appear to check whether the categories still separate cleanly.

Load-bearing premise

That these four categories fully cover the main safety challenges in ML without significant gaps or overlaps needing a different structure.

What would settle it

Discovery of a major safety issue in deployed large models that fits none of the four categories or shows that re-organizing the problems would accelerate progress more effectively.

read the original abstract

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a useful roadmap that groups ML safety into four research areas with concrete directions, but it is a qualitative synthesis rather than a technical advance.

read the letter

The main takeaway is that Hendrycks and colleagues have taken scattered safety issues from large models and organized them into four buckets: Robustness for handling hazards, Monitoring for detecting them, Alignment for fixing internal model problems, and Systemic Safety for broader deployment risks. They list specific open questions under each and tie them to recent scaling behaviors and failures. That structure is the real contribution here. It gives people a clearer map than the usual list of worries, and the directions feel practical enough to start work on, like testing for distribution shifts or building anomaly detectors. The paper draws from documented examples without overreaching on new results. The soft spots are straightforward. The four categories are presented as a helpful lens rather than something derived or validated; overlaps exist, such as between alignment and robustness, and there is no argument that these are exhaustive. Everything stays qualitative with no new experiments, data, or formal checks to support the priorities. That matches the goal of a roadmap, but it limits how much weight the framework can carry on its own. This is worth reading for anyone already working on or entering ML safety who wants a set of focused problems rather than scattered papers. A reader planning projects or looking for citations to open issues would get direct value. It deserves peer review because the problems are current and the organization is coherent enough to spark useful discussion and revisions.

Referee Report

0 major / 2 minor

Summary. The paper claims that ML systems are rapidly scaling in size and capabilities while being deployed in high-stakes settings, making safety a leading priority. It refines the technical problems in the field into four research-ready categories—withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety)—and supplies motivations drawn from observed large-model behaviors along with concrete research directions for each.

Significance. If the taxonomy holds as a useful organizing lens, the paper provides a coherent roadmap that could help prioritize and structure ML safety research around scaling trends and documented failure modes. Its conceptual clarity and focus on actionable directions represent a strength for guiding community efforts, though the framework's durability will depend on subsequent research outputs validating or refining the partition.

minor comments (2)

[Abstract] The abstract and introduction could briefly note potential boundary cases between categories (e.g., whether certain adversarial robustness issues fall under Robustness or Alignment) to preempt reader questions about overlaps, even if the paper does not claim exhaustiveness.
[Systemic Safety] Some research directions listed under Systemic Safety would benefit from one or two additional citations to contemporaneous work on deployment risks to strengthen the motivation section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, assessment of significance, and recommendation to accept the manuscript. We appreciate the recognition that the four-problem taxonomy offers a coherent and actionable roadmap for ML safety research.

Circularity Check

0 steps flagged

No significant circularity in proposed taxonomy

full rationale

The paper is a high-level roadmap that organizes ML safety into four categories (Robustness, Monitoring, Alignment, Systemic Safety) motivated by scaling trends and deployment contexts. No equations, derivations, fitted parameters, or predictions appear anywhere in the manuscript. The central claim is an organizing lens rather than a technical result that could reduce to its own inputs by construction. No self-citations function as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are smuggled in. The structure is presented as a useful research agenda, not a provably minimal or derived partition, making the paper self-contained with zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on domain assumptions about ML capability trends and the necessity of safety research; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Machine learning systems are rapidly increasing in size, acquiring new capabilities, and being deployed in high-stakes settings
Stated directly in the abstract as the premise motivating the roadmap.
domain assumption Safety for ML should be a leading research priority
Presented as the guiding stance for the work.

pith-pipeline@v0.9.0 · 5411 in / 1200 out tokens · 39709 ms · 2026-05-16T20:42:44.186395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean; IndisputableMonolith/Cost/FunctionalEquation.lean reality_from_one_distinction; washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present four problems ready for research, namely withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
cs.LG 2022-11 conditional novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Benchmarking Sensor-Fault Robustness in Forecasting
cs.LG 2026-05 conditional novelty 7.0

SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
cs.LG 2026-04 unverdicted novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
Red Teaming Language Models with Language Models
cs.CL 2022-02 conditional novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
cs.SE 2026-05 unverdicted novelty 6.0

SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
cs.AI 2026-02 unverdicted novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
cs.AI 2026-05 unverdicted novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...
Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count
cs.CV 2026-04 unverdicted novelty 5.0

Higher face density causes monotonic performance degradation in models and acts as a domain shift, even under balanced sampling.
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
cs.AI 2025-12 unverdicted novelty 3.0

LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.

Reference graph

Works this paper leans on

228 extracted references · 228 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Asilomar AI Principles

Signed by approximately 2000 AI researchers. “Asilomar AI Principles”. In: (2017)

work page 2000
[2]

Autonomous Weapons: An Open Letter from AI and Robotics Researchers

Signed by 30000+ people. “Autonomous Weapons: An Open Letter from AI and Robotics Researchers”. In: (2015)

work page 2015
[3]

Deep Learning with Differential Privacy

Martín Abadi, Andy Chu, I. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal Talwar, and L. Zhang. “Deep Learning with Differential Privacy”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016)

work page 2016
[4]

Network intrusion detection system: A systematic study of machine learning and deep learning approaches

Zeeshan Ahmad, A. Khan, W. Cheah, J. Abdullah, and Farhan Ahmad. “Network intrusion detection system: A systematic study of machine learning and deep learning approaches”. In: Trans. Emerg. Telecommun. Technol.(2021)

work page 2021
[5]

Concrete Problems in AI Safety

Dario Amodei, Christopher Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dandelion Mané. “Concrete Problems in AI Safety”. In: ArXiv (2016)

work page 2016
[6]

Programming Satan’s Computer

Ross J. Anderson and Roger Needham. “Programming Satan’s Computer”. In: Computer Science Today. 1995

work page 1995
[7]

Drago Anguelov.Machine Learning for Autonomous Driving. 2019. URL: https://www.youtube. com/watch?v=Q0nGo2-y0xY

work page 2019
[8]

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”. In: ICML. 2018

work page 2018
[9]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. “Program Synthesis with Large Language Models”. In: ArXiv (2021)

work page 2021
[10]

Blind Backdoors in Deep Learning Models

Eugene Bagdasaryan and Vitaly Shmatikov. “Blind Backdoors in Deep Learning Models”. In: USENIX Security Symposium. 2021

work page 2021
[11]

Towards Open Set Deep Networks

Abhijit Bendale and Terrance Boult. “Towards Open Set Deep Networks”. In: CVPR (2016)

work page 2016
[12]

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021)

work page 2021
[13]

Alien Dreams: An Emerging Art Scene

Machine Learning at Berkeley. Alien Dreams: An Emerging Art Scene . URL: https : / / ml . berkeley.edu/blog/posts/clip-art/

work page
[14]

Triggering Failures: Out-Of- Distribution detection by learning from local adversarial attacks in Semantic Segmentation

Victor Besnier, Andrei Bursuc, David Picard, and Alexandre Briot. “Triggering Failures: Out-Of- Distribution detection by learning from local adversarial attacks in Semantic Segmentation”. In: ArXiv abs/2108.01634 (2021)

work page arXiv 2021
[15]

Evasion attacks against machine learning at test time

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. “Evasion attacks against machine learning at test time”. In: Joint European conference on machine learning and knowledge discovery in databases. Springer. 2013, pp. 387–402

work page 2013
[16]

The Values Encoded in Machine Learning Research

Abeba Birhane, Pratyusha Kalluri, D. Card, William Agnew, Ravit Dotan, and Michelle Bao. “The Values Encoded in Machine Learning Research”. In: ArXiv (2021)

work page 2021
[17]

Certiﬁably Adversarially Robust Detection of Out-of-Distribution Data

Julian Bitterwolf, Alexander Meinke, and Matthias Hein. “Certiﬁably Adversarially Robust Detection of Out-of-Distribution Data”. In: NeurIPS (2020). 14

work page 2020
[18]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani et al. “On the Opportunities and Risks of Foundation Models”. In: ArXiv (2021)

work page 2021
[19]

The Vulnerable World Hypothesis

Nick Bostrom. “The Vulnerable World Hypothesis”. In: Global Policy (2019)

work page 2019
[20]

Smoking behavior of adolescents exposed to cigarette advertising

G. Botvin, C. Goldberg, E. M. Botvin, and L. Dusenbury. “Smoking behavior of adolescents exposed to cigarette advertising”. In: Public health reports (1993)

work page 1993
[21]

Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models

Wieland Brendel, Jonas Rauber, and Matthias Bethge. “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models”. In: arXiv preprint arXiv:1712.04248 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Hedonic relativism and planning the good society

Philip Brickman and Donald Campbell. “Hedonic relativism and planning the good society”. In: 1971

work page 1971
[23]

Language Models are Few-Shot Learners

T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J....

work page internal anchor Pith review Pith/arXiv arXiv 2005
[24]

The Malicious Use of Artiﬁcial Intelligence: Forecasting, Prevention, and Mitigation

Miles Brundage, Shahar Avin, Jack Clark, H. Toner, P. Eckersley, Ben Garﬁnkel, A. Dafoe, P. Scharre, T. Zeitzoff, Bobby Filar, H. Anderson, Heather Roff, Gregory C. Allen, J. Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, S. Beard, Haydn Belﬁeld, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, ...

work page arXiv 2018
[25]

Toward Trustworthy AI Development: Mechanisms for Supporting Veriﬁable Claims

Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belﬁeld, Gretchen Krueger, Gillian K. Hadﬁeld, Heidy Khlaaf, Jingying Yang, H. Toner, Ruth Fong, Tegan Maharaj, P. W. Koh, Sara Hooker, J. Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensbold, Cullen O’Keefe, Mark Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, Jack Clark, P. Eckersley, Sarah ...

work page 2020
[26]

What the GDP Gets Wrong (Why Managers Should Care)

Erik Brynjolfsson and Adam Saunders. “What the GDP Gets Wrong (Why Managers Should Care)”. In: MIT Sloan Management Review (2009)

work page 2009
[27]

Automating Cyber Attacks

Ben Buchanan, John Bansemer, Dakota Cary, Jack Lucas, and Micah Musser. “Automating Cyber Attacks”. In: 2021

work page 2021
[28]

Truth, Lies, and Automation

Ben Buchanan, Andrew Lohn, Micah Musser, and Katerina Sedova. “Truth, Lies, and Automation”. In: 2021

work page 2021
[29]

Poisoning and Backdooring Contrastive Learning

Nicholas Carlini and A. Terzis. “Poisoning and Backdooring Contrastive Learning”. In: ArXiv abs/2106.09667 (2021)

work page arXiv 2021
[30]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural networks”. In: 2017 ieee symposium on security and privacy (sp). IEEE. 2017, pp. 39–57

work page 2017
[31]

Unlabeled Data Improves Adversarial Robustness

Y . Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C. Duchi. “Unlabeled Data Improves Adversarial Robustness”. In: NeurIPS. 2019

work page 2019
[32]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. “Emerging Properties in Self-Supervised Vision Transformers”. In: Proceedings of the International Conference on Computer Vision (ICCV). 2021

work page 2021
[33]

Destructive Cyber Operations and Machine Learning

Dakota Cary and Daniel Cebul. “Destructive Cyber Operations and Machine Learning”. In: 2020. 15

work page 2020
[34]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, J. Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea. Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,...

work page 2021
[35]

Stateful Detection of Black-Box Adversarial Attacks

Steven Chen, Nicholas Carlini, and David A. Wagner. “Stateful Detection of Black-Box Adversarial Attacks”. In: Proceedings of the 1st ACM Workshop on Security and Privacy on Artiﬁcial Intelligence (2019)

work page 2019
[36]

Faulty Reward Functions in the Wild

Jack Clark and Dario Amodei. “Faulty Reward Functions in the Wild”. In: OpenAI (2016)

work page 2016
[37]

Quantifying General- ization in Reinforcement Learning

Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and J. Schulman. “Quantifying General- ization in Reinforcement Learning”. In: ICML. 2019

work page 2019
[38]

Certiﬁed Adversarial Robustness via Randomized Smoothing

Jeremy M. Cohen, Elan Rosenfeld, and J. Z. Kolter. “Certiﬁed Adversarial Robustness via Randomized Smoothing”. In: ICML. 2019

work page 2019
[39]

Northern Command Public Affairs

North American Aerospace Defense Command and U.S. Northern Command Public Affairs. 2021. URL: https://www.af.mil/News/Article-Display/Article/2703548/norad- usnorthcom-lead-3rd-global-information-dominance-experiment/

work page arXiv 2021
[40]

AI Research Considerations for Human Existential Safety (ARCHES)

Andrew Critch and David Krueger. “AI Research Considerations for Human Existential Safety (ARCHES)”. In: ArXiv (2020)

work page 2020
[41]

RobustBench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, V . Sehwag, Nicolas Flammarion, M. Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: ArXiv abs/2010.09670 (2020)

work page arXiv 2010
[42]

Monitor alarm fatigue: an integrative review

Maria Cvach. “Monitor alarm fatigue: an integrative review”. In: Biomedical instrumentation & technology (2012)

work page 2012
[43]

AI governance: a research agenda

Allan Dafoe. “AI governance: a research agenda”. In: Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK (2018)

work page 2018
[44]

Open Problems in Cooperative AI

Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. “Open Problems in Cooperative AI”. In: ArXiv (2020)

work page 2020
[45]

Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results

Mohamad H. Danesh and Alan Fern. “Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results”. In: ArXiv abs/2107.04982 (2021)

work page arXiv 2021
[46]

Quadrennial Defense Review Report

Department of Defense. “Quadrennial Defense Review Report”. In: (2001)

work page 2001
[47]

A history of internet security

Laura DeNardis. “A history of internet security”. In: The history of information security. Elsevier, 2007

work page 2007
[48]

Robust artiﬁcial intelligence and robust human organizations

Thomas G. Dietterich. “Robust artiﬁcial intelligence and robust human organizations”. In: Frontiers of Computer Science (2018)

work page 2018
[49]

Reinforcement Learning Under Moral Uncertainty

Adrien Ecoffet and Joel Lehman. “Reinforcement Learning Under Moral Uncertainty”. In: ArXiv abs/2006.04734 (2021)

work page arXiv 2006
[50]

Measuring and Improving Consistency in Pretrained Language Models

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, E. Hovy, Hinrich Schütze, and Yoav Goldberg. “Measuring and Improving Consistency in Pretrained Language Models”. In: ArXiv (2021)

work page 2021
[51]

A rotation and a translation sufﬁce: Fooling cnns with simple transformations

Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. “A rotation and a translation sufﬁce: Fooling cnns with simple transformations”. In: arXiv (2018). 16

work page 2018
[52]

Bringing People Closer Together

Facebook. Bringing People Closer Together. URL: https://about.fb.com/news/2018/01/ news-feed-fyi-bringing-people-closer-together/

work page 2018
[53]

Inequality and Violent Crime

Pablo Fajnzylber, Daniel Lederman, and Norman V . Loayza. “Inequality and Violent Crime”. In:The Journal of Law and Economics (2002)

work page 2002
[54]

Assessment results regarding Organization Designation Authorization (ODA) Unit Member (UM) Independence

Wendi Folkert. “Assessment results regarding Organization Designation Authorization (ODA) Unit Member (UM) Independence”. In: Aviation Safety (2021)

work page 2021
[55]

System Safety in Aircraft Acquisition

F. R. Frola and C. O. Miller. “System Safety in Aircraft Acquisition”. In: 1984

work page 1984
[56]

Artiﬁcial Intelligence, Values and Alignment

Iason Gabriel. “Artiﬁcial Intelligence, Values and Alignment”. In: ArXiv (2020)

work page 2020
[57]

Systemantics: How Systems Work and Especially How They Fail

John Gall. “Systemantics: How Systems Work and Especially How They Fail”. In: 1977

work page 1977
[58]

Augmenting Decision Making via Interactive What-If Analysis

Sneha Gathani, Madelon Hulsebos, James Gale, P. Haas, and cCaugatay Demiralp. “Augmenting Decision Making via Interactive What-If Analysis”. In: 2021

work page 2021
[59]

What drives tropical deforestation?: a meta-analysis of proximate and underlying causes of deforestation based on subnational case study evidence

Helmut Geist and Eric Lambin. “What drives tropical deforestation?: a meta-analysis of proximate and underlying causes of deforestation based on subnational case study evidence”. In: 2001

work page 2001
[60]

A 20-Year Community Roadmap for Artiﬁcial Intelligence Research in the US

Yolanda Gil and Bart Selman. “A 20-Year Community Roadmap for Artiﬁcial Intelligence Research in the US”. In: ArXiv abs/1908.02624 (2019)

work page arXiv 1908
[61]

Motivating the Rules of the Game for Adversarial Example Research

J. Gilmer, Ryan P. Adams, I. Goodfellow, David G. Andersen, and George E. Dahl. “Motivating the Rules of the Game for Adversarial Example Research”. In: ArXiv (2018)

work page 2018
[62]

Explaining Explanations: An Overview of Interpretability of Machine Learning

Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. “Explaining Explanations: An Overview of Interpretability of Machine Learning”. In: (2018)

work page 2018
[63]

Adver- sarial Policies: Attacking Deep Reinforcement Learning

Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart J. Russell. “Adver- sarial Policies: Attacking Deep Reinforcement Learning”. In: ICLR (2020)

work page 2020
[64]

Problems of Monetary Management: The UK Experience

Charles Goodhart. “Problems of Monetary Management: The UK Experience”. In: 1984

work page 1984
[65]

The third industrial revolution: Technology, productivity, and income inequality

Jeremy Greenwood. The third industrial revolution: Technology, productivity, and income inequality

work page
[66]

American Enterprise Institute, 1997

work page 1997
[67]

There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning

Nathan Grinsztajn, Johan Ferret, O. Pietquin, P. Preux, and M. Geist. “There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning”. In:ArXiv abs/2106.04480 (2021)

work page arXiv 2021
[68]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. In: arXiv preprint arXiv:1708.06733 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On Calibration of Modern Neural Networks”. In: ICML (2017)

work page 2017
[70]

The Off-Switch Game

Dylan Hadﬁeld-Menell, A. Dragan, P. Abbeel, and Stuart J. Russell. “The Off-Switch Game”. In:IJCA (2017)

work page 2017
[71]

Cooperative Inverse Reinforce- ment Learning

Dylan Hadﬁeld-Menell, Stuart J. Russell, P. Abbeel, and A. Dragan. “Cooperative Inverse Reinforce- ment Learning”. In: NIPS. 2016

work page 2016
[72]

Richard Harang and Ethan M. Rudd. SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection. 2020

work page 2020
[73]

Equality of Opportunity in Supervised Learning

Moritz Hardt, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning”. In: NIPS. 2016

work page 2016
[74]

Map-Colour Theorem

P. J. Heawood. “Map-Colour Theorem”. In: Proceedings of The London Mathematical Society (1949), pp. 161–175

work page 1949
[75]

Risky business: safety regulations, risk compensation, and individual behavior

James Hedlund. “Risky business: safety regulations, risk compensation, and individual behavior”. In: Injury Prevention (2000). 17

work page 2000
[76]

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization”. In: ICCV (2021)

work page 2021
[77]

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. “Aligning AI With Shared Human Values”. In: ICLR (2021)

work page 2021
[78]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In: ICLR (2021)

work page 2021
[79]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations”. In: Proceedings of the International Conference on Learning Repre- sentations (2019)

work page 2019
[80]

A Baseline for Detecting Misclassiﬁed and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassiﬁed and Out-of-Distribution Examples in Neural Networks”. In: ICLR (2017)

work page 2017

Showing first 80 references.