Recognition: 1 theorem link
· Lean TheoremUnsolved Problems in ML Safety
Pith reviewed 2026-05-16 20:42 UTC · model grok-4.3
The pith
Machine learning safety should focus on four research areas as models scale and deploy in critical settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present four problems ready for research, namely withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety). Throughout, we clarify each problem's motivation and provide concrete research directions.
What carries the argument
A four-category framework dividing ML safety into Robustness, Monitoring, Alignment, and Systemic Safety.
If this is right
- Research can target concrete directions for withstanding hazards such as adversarial examples and distribution shifts.
- Methods can be developed to identify hazards through uncertainty estimation and anomaly detection.
- Work on alignment can reduce unintended behaviors by better matching model objectives to human intent.
- Systemic safety efforts can address risks arising from widespread deployment and interactions with other systems.
Where Pith is reading between the lines
- This structure could help researchers classify ongoing projects and spot under-explored areas within the four categories.
- It might support cross-disciplinary efforts by linking technical fixes to broader societal deployment concerns.
- The framework could be revisited as new model capabilities appear to check whether the categories still separate cleanly.
Load-bearing premise
That these four categories fully cover the main safety challenges in ML without significant gaps or overlaps needing a different structure.
What would settle it
Discovery of a major safety issue in deployed large models that fits none of the four categories or shows that re-organizing the problems would accelerate progress more effectively.
read the original abstract
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ML systems are rapidly scaling in size and capabilities while being deployed in high-stakes settings, making safety a leading priority. It refines the technical problems in the field into four research-ready categories—withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety)—and supplies motivations drawn from observed large-model behaviors along with concrete research directions for each.
Significance. If the taxonomy holds as a useful organizing lens, the paper provides a coherent roadmap that could help prioritize and structure ML safety research around scaling trends and documented failure modes. Its conceptual clarity and focus on actionable directions represent a strength for guiding community efforts, though the framework's durability will depend on subsequent research outputs validating or refining the partition.
minor comments (2)
- [Abstract] The abstract and introduction could briefly note potential boundary cases between categories (e.g., whether certain adversarial robustness issues fall under Robustness or Alignment) to preempt reader questions about overlaps, even if the paper does not claim exhaustiveness.
- [Systemic Safety] Some research directions listed under Systemic Safety would benefit from one or two additional citations to contemporaneous work on deployment risks to strengthen the motivation section.
Simulated Author's Rebuttal
We thank the referee for their positive summary, assessment of significance, and recommendation to accept the manuscript. We appreciate the recognition that the four-problem taxonomy offers a coherent and actionable roadmap for ML safety research.
Circularity Check
No significant circularity in proposed taxonomy
full rationale
The paper is a high-level roadmap that organizes ML safety into four categories (Robustness, Monitoring, Alignment, Systemic Safety) motivated by scaling trends and deployment contexts. No equations, derivations, fitted parameters, or predictions appear anywhere in the manuscript. The central claim is an organizing lens rather than a technical result that could reduce to its own inputs by construction. No self-citations function as load-bearing uniqueness theorems, and no ansatzes or renamings of known results are smuggled in. The structure is presented as a useful research agenda, not a provably minimal or derived partition, making the paper self-contained with zero circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Machine learning systems are rapidly increasing in size, acquiring new capabilities, and being deployed in high-stakes settings
- domain assumption Safety for ML should be a leading research priority
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.lean; IndisputableMonolith/Cost/FunctionalEquation.leanreality_from_one_distinction; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present four problems ready for research, namely withstanding hazards (Robustness), identifying hazards (Monitoring), reducing inherent model hazards (Alignment), and reducing systemic hazards (Systemic Safety).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Benchmarking Sensor-Fault Robustness in Forecasting
SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
-
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
-
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
-
Red Teaming Language Models with Language Models
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
-
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
-
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents
A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...
-
Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count
Higher face density causes monotonic performance degradation in models and acts as a domain shift, even under balanced sampling.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
Reference graph
Works this paper leans on
-
[1]
Signed by approximately 2000 AI researchers. “Asilomar AI Principles”. In: (2017)
work page 2000
-
[2]
Autonomous Weapons: An Open Letter from AI and Robotics Researchers
Signed by 30000+ people. “Autonomous Weapons: An Open Letter from AI and Robotics Researchers”. In: (2015)
work page 2015
-
[3]
Deep Learning with Differential Privacy
Martín Abadi, Andy Chu, I. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal Talwar, and L. Zhang. “Deep Learning with Differential Privacy”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016)
work page 2016
-
[4]
Zeeshan Ahmad, A. Khan, W. Cheah, J. Abdullah, and Farhan Ahmad. “Network intrusion detection system: A systematic study of machine learning and deep learning approaches”. In: Trans. Emerg. Telecommun. Technol.(2021)
work page 2021
-
[5]
Concrete Problems in AI Safety
Dario Amodei, Christopher Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dandelion Mané. “Concrete Problems in AI Safety”. In: ArXiv (2016)
work page 2016
-
[6]
Ross J. Anderson and Roger Needham. “Programming Satan’s Computer”. In: Computer Science Today. 1995
work page 1995
-
[7]
Drago Anguelov.Machine Learning for Autonomous Driving. 2019. URL: https://www.youtube. com/watch?v=Q0nGo2-y0xY
work page 2019
-
[8]
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Anish Athalye, Nicholas Carlini, and David A. Wagner. “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”. In: ICML. 2018
work page 2018
-
[9]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc V . Le, and Charles Sutton. “Program Synthesis with Large Language Models”. In: ArXiv (2021)
work page 2021
-
[10]
Blind Backdoors in Deep Learning Models
Eugene Bagdasaryan and Vitaly Shmatikov. “Blind Backdoors in Deep Learning Models”. In: USENIX Security Symposium. 2021
work page 2021
-
[11]
Towards Open Set Deep Networks
Abhijit Bendale and Terrance Boult. “Towards Open Set Deep Networks”. In: CVPR (2016)
work page 2016
-
[12]
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021)
work page 2021
-
[13]
Alien Dreams: An Emerging Art Scene
Machine Learning at Berkeley. Alien Dreams: An Emerging Art Scene . URL: https : / / ml . berkeley.edu/blog/posts/clip-art/
-
[14]
Victor Besnier, Andrei Bursuc, David Picard, and Alexandre Briot. “Triggering Failures: Out-Of- Distribution detection by learning from local adversarial attacks in Semantic Segmentation”. In: ArXiv abs/2108.01634 (2021)
-
[15]
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. “Evasion attacks against machine learning at test time”. In: Joint European conference on machine learning and knowledge discovery in databases. Springer. 2013, pp. 387–402
work page 2013
-
[16]
The Values Encoded in Machine Learning Research
Abeba Birhane, Pratyusha Kalluri, D. Card, William Agnew, Ravit Dotan, and Michelle Bao. “The Values Encoded in Machine Learning Research”. In: ArXiv (2021)
work page 2021
-
[17]
Certifiably Adversarially Robust Detection of Out-of-Distribution Data
Julian Bitterwolf, Alexander Meinke, and Matthias Hein. “Certifiably Adversarially Robust Detection of Out-of-Distribution Data”. In: NeurIPS (2020). 14
work page 2020
-
[18]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani et al. “On the Opportunities and Risks of Foundation Models”. In: ArXiv (2021)
work page 2021
-
[19]
The Vulnerable World Hypothesis
Nick Bostrom. “The Vulnerable World Hypothesis”. In: Global Policy (2019)
work page 2019
-
[20]
Smoking behavior of adolescents exposed to cigarette advertising
G. Botvin, C. Goldberg, E. M. Botvin, and L. Dusenbury. “Smoking behavior of adolescents exposed to cigarette advertising”. In: Public health reports (1993)
work page 1993
-
[21]
Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models
Wieland Brendel, Jonas Rauber, and Matthias Bethge. “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models”. In: arXiv preprint arXiv:1712.04248 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Hedonic relativism and planning the good society
Philip Brickman and Donald Campbell. “Hedonic relativism and planning the good society”. In: 1971
work page 1971
-
[23]
Language Models are Few-Shot Learners
T. Brown, B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, G. Krüger, T. Henighan, R. Child, Aditya Ramesh, D. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, E. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, J....
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[24]
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
Miles Brundage, Shahar Avin, Jack Clark, H. Toner, P. Eckersley, Ben Garfinkel, A. Dafoe, P. Scharre, T. Zeitzoff, Bobby Filar, H. Anderson, Heather Roff, Gregory C. Allen, J. Steinhardt, Carrick Flynn, Seán Ó hÉigeartaigh, S. Beard, Haydn Belfield, Sebastian Farquhar, Clare Lyle, Rebecca Crootof, Owain Evans, Michael Page, Joanna Bryson, Roman Yampolskiy, ...
-
[25]
Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims
Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian K. Hadfield, Heidy Khlaaf, Jingying Yang, H. Toner, Ruth Fong, Tegan Maharaj, P. W. Koh, Sara Hooker, J. Leung, Andrew Trask, Emma Bluemke, Jonathan Lebensbold, Cullen O’Keefe, Mark Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, Jack Clark, P. Eckersley, Sarah ...
work page 2020
-
[26]
What the GDP Gets Wrong (Why Managers Should Care)
Erik Brynjolfsson and Adam Saunders. “What the GDP Gets Wrong (Why Managers Should Care)”. In: MIT Sloan Management Review (2009)
work page 2009
-
[27]
Ben Buchanan, John Bansemer, Dakota Cary, Jack Lucas, and Micah Musser. “Automating Cyber Attacks”. In: 2021
work page 2021
-
[28]
Ben Buchanan, Andrew Lohn, Micah Musser, and Katerina Sedova. “Truth, Lies, and Automation”. In: 2021
work page 2021
-
[29]
Poisoning and Backdooring Contrastive Learning
Nicholas Carlini and A. Terzis. “Poisoning and Backdooring Contrastive Learning”. In: ArXiv abs/2106.09667 (2021)
-
[30]
Towards evaluating the robustness of neural networks
Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural networks”. In: 2017 ieee symposium on security and privacy (sp). IEEE. 2017, pp. 39–57
work page 2017
-
[31]
Unlabeled Data Improves Adversarial Robustness
Y . Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C. Duchi. “Unlabeled Data Improves Adversarial Robustness”. In: NeurIPS. 2019
work page 2019
-
[32]
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. “Emerging Properties in Self-Supervised Vision Transformers”. In: Proceedings of the International Conference on Computer Vision (ICCV). 2021
work page 2021
-
[33]
Destructive Cyber Operations and Machine Learning
Dakota Cary and Daniel Cebul. “Destructive Cyber Operations and Machine Learning”. In: 2020. 15
work page 2020
-
[34]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, J. Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea. Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,...
work page 2021
-
[35]
Stateful Detection of Black-Box Adversarial Attacks
Steven Chen, Nicholas Carlini, and David A. Wagner. “Stateful Detection of Black-Box Adversarial Attacks”. In: Proceedings of the 1st ACM Workshop on Security and Privacy on Artificial Intelligence (2019)
work page 2019
-
[36]
Faulty Reward Functions in the Wild
Jack Clark and Dario Amodei. “Faulty Reward Functions in the Wild”. In: OpenAI (2016)
work page 2016
-
[37]
Quantifying General- ization in Reinforcement Learning
Karl Cobbe, Oleg Klimov, Christopher Hesse, Taehoon Kim, and J. Schulman. “Quantifying General- ization in Reinforcement Learning”. In: ICML. 2019
work page 2019
-
[38]
Certified Adversarial Robustness via Randomized Smoothing
Jeremy M. Cohen, Elan Rosenfeld, and J. Z. Kolter. “Certified Adversarial Robustness via Randomized Smoothing”. In: ICML. 2019
work page 2019
-
[39]
Northern Command Public Affairs
North American Aerospace Defense Command and U.S. Northern Command Public Affairs. 2021. URL: https://www.af.mil/News/Article-Display/Article/2703548/norad- usnorthcom-lead-3rd-global-information-dominance-experiment/
-
[40]
AI Research Considerations for Human Existential Safety (ARCHES)
Andrew Critch and David Krueger. “AI Research Considerations for Human Existential Safety (ARCHES)”. In: ArXiv (2020)
work page 2020
-
[41]
RobustBench: a standardized adversarial robustness benchmark
Francesco Croce, Maksym Andriushchenko, V . Sehwag, Nicolas Flammarion, M. Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: ArXiv abs/2010.09670 (2020)
-
[42]
Monitor alarm fatigue: an integrative review
Maria Cvach. “Monitor alarm fatigue: an integrative review”. In: Biomedical instrumentation & technology (2012)
work page 2012
-
[43]
AI governance: a research agenda
Allan Dafoe. “AI governance: a research agenda”. In: Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK (2018)
work page 2018
-
[44]
Open Problems in Cooperative AI
Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. “Open Problems in Cooperative AI”. In: ArXiv (2020)
work page 2020
-
[45]
Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results
Mohamad H. Danesh and Alan Fern. “Out-of-Distribution Dynamics Detection: RL-Relevant Bench- marks and Results”. In: ArXiv abs/2107.04982 (2021)
-
[46]
Quadrennial Defense Review Report
Department of Defense. “Quadrennial Defense Review Report”. In: (2001)
work page 2001
-
[47]
A history of internet security
Laura DeNardis. “A history of internet security”. In: The history of information security. Elsevier, 2007
work page 2007
-
[48]
Robust artificial intelligence and robust human organizations
Thomas G. Dietterich. “Robust artificial intelligence and robust human organizations”. In: Frontiers of Computer Science (2018)
work page 2018
-
[49]
Reinforcement Learning Under Moral Uncertainty
Adrien Ecoffet and Joel Lehman. “Reinforcement Learning Under Moral Uncertainty”. In: ArXiv abs/2006.04734 (2021)
-
[50]
Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, E. Hovy, Hinrich Schütze, and Yoav Goldberg. “Measuring and Improving Consistency in Pretrained Language Models”. In: ArXiv (2021)
work page 2021
-
[51]
A rotation and a translation suffice: Fooling cnns with simple transformations
Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. “A rotation and a translation suffice: Fooling cnns with simple transformations”. In: arXiv (2018). 16
work page 2018
-
[52]
Bringing People Closer Together
Facebook. Bringing People Closer Together. URL: https://about.fb.com/news/2018/01/ news-feed-fyi-bringing-people-closer-together/
work page 2018
-
[53]
Pablo Fajnzylber, Daniel Lederman, and Norman V . Loayza. “Inequality and Violent Crime”. In:The Journal of Law and Economics (2002)
work page 2002
-
[54]
Wendi Folkert. “Assessment results regarding Organization Designation Authorization (ODA) Unit Member (UM) Independence”. In: Aviation Safety (2021)
work page 2021
-
[55]
System Safety in Aircraft Acquisition
F. R. Frola and C. O. Miller. “System Safety in Aircraft Acquisition”. In: 1984
work page 1984
-
[56]
Artificial Intelligence, Values and Alignment
Iason Gabriel. “Artificial Intelligence, Values and Alignment”. In: ArXiv (2020)
work page 2020
-
[57]
Systemantics: How Systems Work and Especially How They Fail
John Gall. “Systemantics: How Systems Work and Especially How They Fail”. In: 1977
work page 1977
-
[58]
Augmenting Decision Making via Interactive What-If Analysis
Sneha Gathani, Madelon Hulsebos, James Gale, P. Haas, and cCaugatay Demiralp. “Augmenting Decision Making via Interactive What-If Analysis”. In: 2021
work page 2021
-
[59]
Helmut Geist and Eric Lambin. “What drives tropical deforestation?: a meta-analysis of proximate and underlying causes of deforestation based on subnational case study evidence”. In: 2001
work page 2001
-
[60]
A 20-Year Community Roadmap for Artificial Intelligence Research in the US
Yolanda Gil and Bart Selman. “A 20-Year Community Roadmap for Artificial Intelligence Research in the US”. In: ArXiv abs/1908.02624 (2019)
-
[61]
Motivating the Rules of the Game for Adversarial Example Research
J. Gilmer, Ryan P. Adams, I. Goodfellow, David G. Andersen, and George E. Dahl. “Motivating the Rules of the Game for Adversarial Example Research”. In: ArXiv (2018)
work page 2018
-
[62]
Explaining Explanations: An Overview of Interpretability of Machine Learning
Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael A. Specter, and Lalana Kagal. “Explaining Explanations: An Overview of Interpretability of Machine Learning”. In: (2018)
work page 2018
-
[63]
Adver- sarial Policies: Attacking Deep Reinforcement Learning
Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart J. Russell. “Adver- sarial Policies: Attacking Deep Reinforcement Learning”. In: ICLR (2020)
work page 2020
-
[64]
Problems of Monetary Management: The UK Experience
Charles Goodhart. “Problems of Monetary Management: The UK Experience”. In: 1984
work page 1984
-
[65]
The third industrial revolution: Technology, productivity, and income inequality
Jeremy Greenwood. The third industrial revolution: Technology, productivity, and income inequality
-
[66]
American Enterprise Institute, 1997
work page 1997
-
[67]
There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning
Nathan Grinsztajn, Johan Ferret, O. Pietquin, P. Preux, and M. Geist. “There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning”. In:ArXiv abs/2106.04480 (2021)
-
[68]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. In: arXiv preprint arXiv:1708.06733 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[69]
On Calibration of Modern Neural Networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. “On Calibration of Modern Neural Networks”. In: ICML (2017)
work page 2017
-
[70]
Dylan Hadfield-Menell, A. Dragan, P. Abbeel, and Stuart J. Russell. “The Off-Switch Game”. In:IJCA (2017)
work page 2017
-
[71]
Cooperative Inverse Reinforce- ment Learning
Dylan Hadfield-Menell, Stuart J. Russell, P. Abbeel, and A. Dragan. “Cooperative Inverse Reinforce- ment Learning”. In: NIPS. 2016
work page 2016
-
[72]
Richard Harang and Ethan M. Rudd. SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection. 2020
work page 2020
-
[73]
Equality of Opportunity in Supervised Learning
Moritz Hardt, Eric Price, and Nathan Srebro. “Equality of Opportunity in Supervised Learning”. In: NIPS. 2016
work page 2016
-
[74]
P. J. Heawood. “Map-Colour Theorem”. In: Proceedings of The London Mathematical Society (1949), pp. 161–175
work page 1949
-
[75]
Risky business: safety regulations, risk compensation, and individual behavior
James Hedlund. “Risky business: safety regulations, risk compensation, and individual behavior”. In: Injury Prevention (2000). 17
work page 2000
-
[76]
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization”. In: ICCV (2021)
work page 2021
-
[77]
Aligning AI With Shared Human Values
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. “Aligning AI With Shared Human Values”. In: ICLR (2021)
work page 2021
-
[78]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In: ICLR (2021)
work page 2021
-
[79]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations”. In: Proceedings of the International Conference on Learning Repre- sentations (2019)
work page 2019
-
[80]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks”. In: ICLR (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.