pith. sign in

arxiv: 2408.12935 · v4 · pith:OGXEOEAEnew · submitted 2024-08-23 · 💻 cs.AI

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Pith reviewed 2026-05-23 21:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords AI safetylarge language modelstrustworthy AIresponsible AIsafe AItaxonomygenerative AIrisk mitigation
0
0 comments X

The pith

A three-perspective framework divides AI safety into trustworthy, responsible, and safe AI to analyze risks in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a structured way to map the AI safety field by separating it into three perspectives: Trustworthy AI, Responsible AI, and Safe AI. This division serves as the basis for reviewing existing work, naming challenges, and outlining mitigation methods, with concrete illustrations drawn from large language models. A sympathetic reader would care because the rapid spread of generative AI has expanded safety concerns to include public and national security impacts, and a clearer map could help coordinate responses across design, testing, and deployment. If the structure holds, it supplies a consistent lens for spotting gaps and developing targeted techniques. The authors close by tying the framework to the broader goal of increasing trust in AI-driven systems.

Core claim

The authors propose a novel architectural framework for understanding and analyzing AI Safety by defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. They conduct an extensive review of current research and advancements from these perspectives, highlight key challenges and mitigation approaches, and illustrate innovative mechanisms, methodologies, and techniques for designing and testing AI safety through state-of-the-art examples from large language models, with the aim of promoting further research and enhancing trust in digital transformation.

What carries the argument

The three-perspective architectural framework that partitions AI safety into Trustworthy AI, Responsible AI, and Safe AI.

If this is right

  • Safety research on LLMs can be organized by assigning each challenge and mitigation to one of the three perspectives.
  • Reviews of state-of-the-art techniques gain consistency when grouped under Trustworthy AI, Responsible AI, or Safe AI.
  • Development of new testing and design methods can target specific perspectives to address identified gaps.
  • Policy and deployment decisions can reference the framework to balance the three aspects when scaling LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a template for classifying safety issues in non-language AI systems such as vision or robotics models.
  • Empirical validation might involve mapping a fixed set of documented LLM incidents onto the three categories to check for coverage.
  • The framework implies that future work should produce separate roadmaps or benchmarks for each perspective rather than a single undifferentiated list.

Load-bearing premise

The assumption that partitioning AI safety into the three perspectives of Trustworthy AI, Responsible AI, and Safe AI provides a comprehensive and non-redundant structure that usefully organizes the entire field and its challenges for LLMs.

What would settle it

Discovery of a major LLM safety issue, such as an emergent failure mode in generation or alignment, that cannot be assigned to any one of the three perspectives without creating substantial overlap or leaving it outside the framework.

Figures

Figures reproduced from arXiv: 2408.12935 by Chen Chen, Kwok-Yan Lam, Si Qi Goh, Weifeng Jiang, Xueluan Gong, Ziyao Liu.

Figure 1
Figure 1. Figure 1: Conceptual relationships and dependencies among trustworthy AI, responsible AI, safe AI, and AI safety. Note that such [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relations between AI foundation model and AI systems. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Various attacks on Multi-modal LLMs. (a) Structure-based attack, (b) Perturbation-based attack, (c) Poisoning-based attack [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attacks on text watermarks. (a) Removal attacks. The detector fails to recognize text as LLM-generated after watermark [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Misuse cases of LLM systems and associated risks to data supply chains. [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The progression of AI capabilities. ACM Comput. Surv [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Automatic red-teaming methods using LLMs. They include the strategies of obtaining harmful prompts by: (a) Training a red [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Instruction tuning strategies to enhance LLM safety. (a) Standard instruction tuning. (b) MART is an iterative approach where [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of various defensive prompt strategies. [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An overview of guardrail systems. 6.4.1 Input Module. Input modules typically follow a detect-then-drop methodology, where user queries identified as malicious are directly rejected. This approach ensures that harmful or inappropriate inputs are filtered out at the earliest possible stage, thereby reducing the computational burden on the protected LLMs. Early detection research primarily employs keyword m… view at source ↗
Figure 11
Figure 11. Figure 11: Stakeholders within AI governance framework. [PITH_FULL_IMAGE:figures/full_fig_p048_11.png] view at source ↗
read the original abstract

AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a novel architectural framework for AI Safety, structured around three perspectives—Trustworthy AI, Responsible AI, and Safe AI—and delivers an extensive literature review of state-of-the-art research, key challenges, and mitigation strategies, with particular emphasis on examples from LLMs.

Significance. A well-bounded taxonomy could help organize the fragmented AI safety literature for LLMs and surface gaps for future work; the paper's value therefore hinges on whether the three perspectives are shown to be both comprehensive and non-overlapping.

major comments (1)
  1. [Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the framework's clarity.

read point-by-point responses
  1. Referee: [Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”

    Authors: We acknowledge that the current presentation in the abstract and §1 relies on descriptive categorization and literature examples rather than formal operational definitions or explicit assignment/exclusion rules. To directly support the claim of a comprehensive and non-redundant structure, we will revise §1 to add: (1) concise operational definitions for each perspective (Trustworthy AI: emphasis on technical reliability and verifiability; Responsible AI: emphasis on ethical, societal, and governance aspects; Safe AI: emphasis on preventing harm and ensuring alignment with human intent); (2) assignment criteria with examples (e.g., robustness and explainability assigned to Trustworthy AI, bias/fairness to Responsible AI, and alignment/harm prevention to Safe AI); and (3) exclusion rules noting that while minor overlaps exist, primary objectives determine placement. A new summary table will map key topics such as alignment, robustness, and bias mitigation to perspectives. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy proposal with no derivations or self-referential reductions

full rationale

The paper is a literature review and taxonomy proposal. It introduces a three-perspective framework (Trustworthy AI, Responsible AI, Safe AI) as its central contribution but contains no equations, fitted parameters, predictions, or derivation chains. No load-bearing steps reduce by construction to inputs, self-citations, or prior author work. The framework is presented as an organizational ansatz without claiming to derive results from itself or external benchmarks in a circular way. This matches the default expectation for non-circular review papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard background concepts from AI ethics and safety literature without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5711 in / 1077 out tokens · 31766 ms · 2026-05-23T21:51:51.734009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Limitations on Accurate, Trusted, Human-level Reasoning

    cs.LG 2025-09 unverdicted novelty 6.0

    An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · cited by 1 Pith paper · 32 internal anchors

  1. [1]

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security . 308–318. 45https://openai.com/index/introducing-superalignment/ ACM Comput. Surv. 56 C. Chen et al

  2. [2]

    Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series, Vol. 69) , Carla E. Brodley (Ed.). https://doi.org/10.1145/1015330.1015430

  3. [3]

    Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023 , Maura ...

  4. [4]

    Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021 , Marion Fourcade, Benjamin Kuipers, Seth Lazar, and Deirdre K. Mulligan (Eds.). 298–306. https://doi.org/10.1145/3461702.3462624

  5. [5]

    Adams, Tyler Cody, and Peter A

    Stephen C. Adams, Tyler Cody, and Peter A. Beling. 2022. A survey of inverse reinforcement learning. Artif. Intell. Rev. 55, 6 (2022), 4307–4346. https://doi.org/10.1007/S10462-021-10108-X

  6. [6]

    Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. 2023. Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. CoRR abs/2311.01463 (2023). https://doi.org/10.48550/ARXIV.2311.01463 arXiv:2311.01463

  7. [7]

    Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 533–549. https://do...

  8. [8]

    Guardrail AI. 2023. Build AI powered applications with confidence. https://www.guardrailsai.com/

  9. [9]

    NIST AI. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). (2023)

  10. [10]

    Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs. 2020. Model extraction from counterfactual explanations. arXiv preprint arXiv:2009.01884 (2020)

  11. [11]

    Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yun-Hsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. CoRR abs/2302.05578 (2023). https://doi.org/10.48550/ARXIV.2302.05578 arXiv:2302.05578

  12. [12]

    Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15, 2 (2023)

  13. [13]

    Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the 2016 election. Journal of economic perspectives 31, 2 (2017), 211–236

  14. [14]

    Bibb Allen, Sheela Agarwal, Jayashree Kalpathy-Cramer, and Keith Dreyer. 2019. Democratizing ai. Journal of the American College of Radiology 16, 7 (2019), 961–963

  15. [15]

    Firas Almukhtar, Nawzad Mahmoodd, and Shahab Kareem. 2021. Search engine optimization: a review. Applied computer science 17, 1 (2021), 70–80

  16. [16]

    Gabriel Alon and Michael Kamfonas. 2023. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132 (2023). https://doi.org/10. 48550/ARXIV.2308.14132 arXiv:2308.14132

  17. [17]

    Srivastava, and Kai-Wei Chang

    Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi ...

  18. [18]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. CoRR abs/1606.06565 (2016). arXiv:1606.06565 http://arxiv.org/abs/1606.06565

  19. [19]

    Markus Anderljung, Joslyn Barnhart, Jade Leung, Anton Korinek, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, et al. 2023. Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718 (2023)

  20. [20]

    Gemini: A Family of Highly Capable Multimodal Models

    Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Is...

  21. [21]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

  22. [22]

    AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024)

  23. [23]

    Marianna Apidianaki and Aina Garí Soler. 2021. ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns’ Semantic Properties and their Prototypicality. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2021, Punta Cana, Dominican Republic, November 11, 2021 , Jasmij...

  24. [24]

    Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant Risk Minimization. CoRR abs/1907.02893 (2019). arXiv:1907.02893 http://arxiv.org/abs/1907.02893

  25. [25]

    Stuart Armstrong. 2010. Utility indifference. (2010)

  26. [26]

    Stuart Armstrong. 2015. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 25, 2015 (AAAI Technical Report, Vol. WS-15-02) , Toby Walsh (Ed.). http://aaai.org/ocs/index.php/WS/AAAIW15/paper/ view/10183

  27. [27]

    Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. Thinking Inside the Box: Controlling and Using an Oracle AI. Minds Mach. 22, 4 (2012), 299–324. https://doi.org/10.1007/S11023-012-9282-2

  28. [28]

    Anupam Arora, Rahul Telang, and Hong Xu. 2021. Do Data Breaches Damage Reputation? Evidence from 45 Cases. Journal of Cybersecurity 7, 1 (2021). https://academic.oup.com/cybersecurity/article/7/1/tyab021/6362163

  29. [29]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https: //arxiv.org/abs/2108.07732

  30. [30]

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. CoRR...

  31. [31]

    Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). arXiv:1607.06450 http: //arxiv.org/abs/1607.06450

  32. [32]

    James Babcock, János Kramár, and Roman Yampolskiy. 2016. The AGI Containment Problem. In Artificial General Intelligence - 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9782) , Bas R. Steunebrink, Pei Wang, and Ben Goertzel (Eds.). 53–63. https://doi.org/10.1007/978-3-319-...

  33. [33]

    James Babcock, Janos Kramar, and Roman V Yampolskiy. 2019. Guidelines for artificial intelligence containment. (2019)

  34. [34]

    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490 (2023)

  35. [35]

    Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings.CoRR abs/2308.11804 (2023). https://doi.org/10.48550/ARXIV.2308.11804 arXiv:2308.11804

  36. [36]

    Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, and Liang Zhao. 2024. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. CoRR abs/2401.00625 (2024). https://doi.org/10.48550/ARXIV.2401.00625 arXiv:2401.00625

  37. [37]

    Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. 2021. Recent Advances in Adversarial Training for Adversarial Robustness. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , Zhi-Hua Zhou (Ed.). 4312–4321. https://doi.org/10.24963/IJCAI.2021/591

  38. [38]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  39. [39]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  40. [40]

    Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image Hijacks: Adversarial Images can Control Generative Models at Runtime. CoRR abs/2309.00236 (2023). https://doi.org/10.48550/ARXIV.2309.00236 arXiv:2309.00236

  41. [41]

    Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

  42. [42]

    Do, Yan Xu, and Pascale Fung

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

  43. [43]

    Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. 2024. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. CoRR abs/2403.18932 (2024). https://doi.org/10.48550/ARXIV.2403.18932 arXiv:2403.18932

  44. [44]

    Hritik Bansal, Fan Yin, Nishad Singhi, Aditya Grover, Yu Yang, and Kai-Wei Chang. 2023. CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 . 112–123. https://doi.org/10.1109/ICCV51070.2023.00017

  45. [45]

    Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. 2021. RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (V...

  46. [46]

    Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. 2022. Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps. CoRR abs/2204.11073 (2022). https://doi.org/10.48550/ARXIV.2204.11073 arXiv:2204.11073

  47. [47]

    Vita Santa Barletta, Danilo Caivano, Domenico Gigante, and Azzurra Ragone. 2023. A Rapid Review of Responsible AI frameworks: How to guide the development of ethical AI. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023, Oulu, Finland, June 14-16, 2023 . 358–367. https://doi.org/10.1145/359...

  48. [48]

    Dipto Barman, Ziyi Guo, and Owen Conlan. 2024. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications (2024), 100545

  49. [49]

    Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias. CoRR abs/2010.14534 (2020). arXiv:2010.14534 https://arxiv.org/abs/2010.14534

  50. [50]

    Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. 2022. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United...

  51. [51]

    Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2019. Identifying and Controlling Important Neurons in Neural Machine Translation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

  52. [52]

    https://openreview.net/forum?id=H1z-PsR5KX

  53. [53]

    Seth D Baum. 2023. Assessing natural global catastrophic risks. Natural Hazards 115, 3 (2023), 2699–2719. https://doi.org/10.1007/s11069-022- 05660-w Epub 2022 Oct 12. PMID: 36245947; PMCID: PMC9553633

  54. [54]

    Tobias Baumann. 2018. Why I expect successful (narrow) alignment. https://s-risks.org/why-i-expect-successful-alignment/

  55. [56]

    Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, ICWSM 2020, Held Virtually, Original Venue: Atlanta, Georgia, USA, June 8-11, 2020, Munmun De Choudhury, Rumi Chunara, Aron Culotta, and Brooke Fo...

  56. [57]

    Mika Beckerich, Laura Plein, and Sergio Coronado. 2023. RatGPT: Turning online LLMs into Proxies for Malware Attacks. CoRR abs/2308.09183 (2023). https://doi.org/10.48550/ARXIV.2308.09183 arXiv:2308.09183

  57. [58]

    Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December...

  58. [59]

    James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. 2020. Secure single-server aggregation with (poly) logarithmic overhead. In ACM SIGSAC Conference on Computer and Communications Security . 1253–1269

  59. [60]

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predictions from Transformers with the Tuned Lens. CoRR abs/2303.08112 (2023). https://doi.org/10.48550/ARXIV.2303.08112 arXiv:2303.08112

  60. [61]

    Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise

  61. [62]

    Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety - A Review. CoRR abs/2404.14082 (2024). https: //doi.org/10.48550/ARXIV.2404.14082 arXiv:2404.14082

  62. [63]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...

  63. [64]

    Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662 (2023). https://doi.org/10.48550/ARXIV.2308.09662 arXiv:2308.09662 ACM Comput. Surv. AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions 59

  64. [65]

    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2023. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. CoRR abs/2309.07875 (2023). https://doi.org/10.48550/ ARXIV.2309.07875 arXiv:2309.07875

  65. [66]

    Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2017. Evasion Attacks against Machine Learning at Test Time. CoRR abs/1708.06131 (2017). arXiv:1708.06131 http://arxiv.org/abs/1708.06131

  66. [67]

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

  67. [68]

    Teemu Birkstedt, Matti Minkkinen, Anushree Tandon, and Matti Mäntymäki. 2023. AI governance: themes, knowledge gaps and future agendas. Internet Research 33, 7 (2023), 133–167

  68. [69]

    Zou, Venkatesh Saligrama, and Adam Tauman Kalai

    Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain , Daniel D. Lee, M...

  69. [70]

    Nick Bostrom. 2002. Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and technology 9 (2002)

  70. [71]

    Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies

  71. [72]

    Djamila Bouhata and Hamouma Moumen. 2022. Byzantine Fault Tolerance in Distributed Machine Learning : a Survey. CoRR abs/2205.02572 (2022). https://doi.org/10.48550/ARXIV.2205.02572 arXiv:2205.02572

  72. [73]

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy . 141–159

  73. [74]

    Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Ka...

  74. [75]

    Boyd and Angelos D

    Stephen W. Boyd and Angelos D. Keromytis. 2004. SQLrand: Preventing SQL Injection Attacks. In Applied Cryptography and Network Security, Second International Conference, ACNS 2004, Yellow Mountain, China, June 8-11, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3089) , Markus Jakobsson, Moti Yung, and Jianying Zhou (Eds.). 292–302. https://do...

  75. [76]

    Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

    Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. CoRR abs/2209.02128 (2022). https://doi.org/10.48550/ARXIV.2209.02128 arXiv:2209.02128

  76. [77]

    CSET Policy Brief. 2021. AI and the Future of Disinformation Campaigns. Center Secur. Emerg. Technol., Georgetown Univ., Washington, DC, USA, Tech. Rep (2021)

  77. [78]

    Blake Brittain. 2023. Pulitzer-winning authors join OpenAI, Microsoft copyright lawsuit. https://www.reuters.com/legal/pulitzer-winning- authors-join-openai-microsoft-copyright-lawsuit-2023-12-20/

  78. [79]

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR abs/1606.01540 (2016). arXiv:1606.01540 http://arxiv.org/abs/1606.01540

  79. [80]

    Clarence Ng David Schnurr Eric Luhman Joe Taylor Li Jing Natalie Summers Ricky Wang Rohan Sahai Ryan O’Rourke Troy Luhman Will DePue Yufei Guo Connor Holmes Bill Peebles Tim Brooks. 2024. Creating video from text. (2024). https://doi.org/10.48550/arXiv.2402.17177

  80. [81]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

Showing first 80 references.