AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Chen Chen; Kwok-Yan Lam; Si Qi Goh; Weifeng Jiang; Xueluan Gong; Ziyao Liu

arxiv: 2408.12935 · v4 · pith:OGXEOEAEnew · submitted 2024-08-23 · 💻 cs.AI

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

Chen Chen , Xueluan Gong , Ziyao Liu , Weifeng Jiang , Si Qi Goh , Kwok-Yan Lam This is my paper

Pith reviewed 2026-05-23 21:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI safetylarge language modelstrustworthy AIresponsible AIsafe AItaxonomygenerative AIrisk mitigation

0 comments

The pith

A three-perspective framework divides AI safety into trustworthy, responsible, and safe AI to analyze risks in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a structured way to map the AI safety field by separating it into three perspectives: Trustworthy AI, Responsible AI, and Safe AI. This division serves as the basis for reviewing existing work, naming challenges, and outlining mitigation methods, with concrete illustrations drawn from large language models. A sympathetic reader would care because the rapid spread of generative AI has expanded safety concerns to include public and national security impacts, and a clearer map could help coordinate responses across design, testing, and deployment. If the structure holds, it supplies a consistent lens for spotting gaps and developing targeted techniques. The authors close by tying the framework to the broader goal of increasing trust in AI-driven systems.

Core claim

The authors propose a novel architectural framework for understanding and analyzing AI Safety by defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. They conduct an extensive review of current research and advancements from these perspectives, highlight key challenges and mitigation approaches, and illustrate innovative mechanisms, methodologies, and techniques for designing and testing AI safety through state-of-the-art examples from large language models, with the aim of promoting further research and enhancing trust in digital transformation.

What carries the argument

The three-perspective architectural framework that partitions AI safety into Trustworthy AI, Responsible AI, and Safe AI.

If this is right

Safety research on LLMs can be organized by assigning each challenge and mitigation to one of the three perspectives.
Reviews of state-of-the-art techniques gain consistency when grouped under Trustworthy AI, Responsible AI, or Safe AI.
Development of new testing and design methods can target specific perspectives to address identified gaps.
Policy and deployment decisions can reference the framework to balance the three aspects when scaling LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a template for classifying safety issues in non-language AI systems such as vision or robotics models.
Empirical validation might involve mapping a fixed set of documented LLM incidents onto the three categories to check for coverage.
The framework implies that future work should produce separate roadmaps or benchmarks for each perspective rather than a single undifferentiated list.

Load-bearing premise

The assumption that partitioning AI safety into the three perspectives of Trustworthy AI, Responsible AI, and Safe AI provides a comprehensive and non-redundant structure that usefully organizes the entire field and its challenges for LLMs.

What would settle it

Discovery of a major LLM safety issue, such as an emergent failure mode in generation or alignment, that cannot be assigned to any one of the three perspectives without creating substantial overlap or leaving it outside the framework.

Figures

Figures reproduced from arXiv: 2408.12935 by Chen Chen, Kwok-Yan Lam, Si Qi Goh, Weifeng Jiang, Xueluan Gong, Ziyao Liu.

**Figure 1.** Figure 1: Conceptual relationships and dependencies among trustworthy AI, responsible AI, safe AI, and AI safety. Note that such [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Relations between AI foundation model and AI systems. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Various attacks on Multi-modal LLMs. (a) Structure-based attack, (b) Perturbation-based attack, (c) Poisoning-based attack [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Attacks on text watermarks. (a) Removal attacks. The detector fails to recognize text as LLM-generated after watermark [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Misuse cases of LLM systems and associated risks to data supply chains. [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: The progression of AI capabilities. ACM Comput. Surv [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Automatic red-teaming methods using LLMs. They include the strategies of obtaining harmful prompts by: (a) Training a red [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

**Figure 8.** Figure 8: Instruction tuning strategies to enhance LLM safety. (a) Standard instruction tuning. (b) MART is an iterative approach where [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of various defensive prompt strategies. [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: An overview of guardrail systems. 6.4.1 Input Module. Input modules typically follow a detect-then-drop methodology, where user queries identified as malicious are directly rejected. This approach ensures that harmful or inappropriate inputs are filtered out at the earliest possible stage, thereby reducing the computational burden on the protected LLMs. Early detection research primarily employs keyword m… view at source ↗

**Figure 11.** Figure 11: Stakeholders within AI governance framework. [PITH_FULL_IMAGE:figures/full_fig_p048_11.png] view at source ↗

read the original abstract

AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A survey reorganizing AI safety into three categories without clear boundaries or new technical results.

read the letter

Hi, The main thing to know is that this is a survey paper that proposes splitting AI safety for LLMs into three perspectives—Trustworthy AI, Responsible AI, and Safe AI—and then reviews existing work through those lenses. It introduces no new mechanisms, measurements, or derivations. What it does reasonably well is pull together challenges and mitigation approaches across the literature, using current LLM examples to illustrate points on robustness, alignment, ethics, and related topics. That synthesis could help someone map the landscape quickly. The soft spot is the taxonomy. The abstract treats the three perspectives as distinct and useful for organizing the field, but it gives no operational definitions, decision criteria, or exclusion rules, so overlap looks possible and the practical advantage over earlier reviews is not obvious. As a literature review rather than a technical paper, there are also no equations, code, or data to check for reproducibility. The paper is aimed at readers who need an overview of AI safety issues tied to LLMs rather than specialists looking for original results. A newcomer to the area might find it a decent starting point, while someone already working in one of the subfields will likely see mostly familiar material. I would send it for peer review if the full text shows thorough coverage and makes a stronger case for why the three-way split reduces redundancy; surveys can still be worth referee effort when they organize the literature cleanly.

Referee Report

1 major / 0 minor

Summary. The paper proposes a novel architectural framework for AI Safety, structured around three perspectives—Trustworthy AI, Responsible AI, and Safe AI—and delivers an extensive literature review of state-of-the-art research, key challenges, and mitigation strategies, with particular emphasis on examples from LLMs.

Significance. A well-bounded taxonomy could help organize the fragmented AI safety literature for LLMs and surface gaps for future work; the paper's value therefore hinges on whether the three perspectives are shown to be both comprehensive and non-overlapping.

major comments (1)

[Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the framework's clarity.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1 (Introduction): the three perspectives are presented as distinct without operational definitions, assignment criteria, or exclusion rules (e.g., where alignment, robustness, or bias mitigation belong). This directly affects the central claim that the framework supplies a “comprehensive and non-redundant structure.”

Authors: We acknowledge that the current presentation in the abstract and §1 relies on descriptive categorization and literature examples rather than formal operational definitions or explicit assignment/exclusion rules. To directly support the claim of a comprehensive and non-redundant structure, we will revise §1 to add: (1) concise operational definitions for each perspective (Trustworthy AI: emphasis on technical reliability and verifiability; Responsible AI: emphasis on ethical, societal, and governance aspects; Safe AI: emphasis on preventing harm and ensuring alignment with human intent); (2) assignment criteria with examples (e.g., robustness and explainability assigned to Trustworthy AI, bias/fairness to Responsible AI, and alignment/harm prevention to Safe AI); and (3) exclusion rules noting that while minor overlaps exist, primary objectives determine placement. A new summary table will map key topics such as alignment, robustness, and bias mitigation to perspectives. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy proposal with no derivations or self-referential reductions

full rationale

The paper is a literature review and taxonomy proposal. It introduces a three-perspective framework (Trustworthy AI, Responsible AI, Safe AI) as its central contribution but contains no equations, fitted parameters, predictions, or derivation chains. No load-bearing steps reduce by construction to inputs, self-citations, or prior author work. The framework is presented as an organizational ansatz without claiming to derive results from itself or external benchmarks in a circular way. This matches the default expectation for non-circular review papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard background concepts from AI ethics and safety literature without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5711 in / 1077 out tokens · 31766 ms · 2026-05-23T21:51:51.734009+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Limitations on Accurate, Trusted, Human-level Reasoning
cs.LG 2025-09 unverdicted novelty 6.0

An accurate and trusted AI system cannot achieve human-level reasoning because there exist tasks easily solvable by humans but not by the system.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · cited by 1 Pith paper · 32 internal anchors

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security . 308–318. 45https://openai.com/index/introducing-superalignment/ ACM Comput. Surv. 56 C. Chen et al

work page 2016
[2]

Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series, Vol. 69) , Carla E. Brodley (Ed.). https://doi.org/10.1145/1015330.1015430

work page doi:10.1145/1015330.1015430 2004
[3]

Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023 , Maura ...

work page doi:10.1145/3605764.3623985 2023
[4]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021 , Marion Fourcade, Benjamin Kuipers, Seth Lazar, and Deirdre K. Mulligan (Eds.). 298–306. https://doi.org/10.1145/3461702.3462624

work page doi:10.1145/3461702.3462624 2021
[5]

Adams, Tyler Cody, and Peter A

Stephen C. Adams, Tyler Cody, and Peter A. Beling. 2022. A survey of inverse reinforcement learning. Artif. Intell. Rev. 55, 6 (2022), 4307–4346. https://doi.org/10.1007/S10462-021-10108-X

work page doi:10.1007/s10462-021-10108-x 2022
[6]

Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. 2023. Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. CoRR abs/2311.01463 (2023). https://doi.org/10.48550/ARXIV.2311.01463 arXiv:2311.01463

work page doi:10.48550/arxiv.2311.01463 2023
[7]

Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 533–549. https://do...

work page doi:10.18653/v1/2021.emnlp-main.42 2021
[8]

Guardrail AI. 2023. Build AI powered applications with confidence. https://www.guardrailsai.com/

work page 2023
[9]

NIST AI. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). (2023)

work page 2023
[10]

Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs. 2020. Model extraction from counterfactual explanations. arXiv preprint arXiv:2009.01884 (2020)

work page arXiv 2020
[11]

Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yun-Hsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. CoRR abs/2302.05578 (2023). https://doi.org/10.48550/ARXIV.2302.05578 arXiv:2302.05578

work page doi:10.48550/arxiv.2302.05578 2023
[12]

Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15, 2 (2023)

work page 2023
[13]

Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the 2016 election. Journal of economic perspectives 31, 2 (2017), 211–236

work page 2017
[14]

Bibb Allen, Sheela Agarwal, Jayashree Kalpathy-Cramer, and Keith Dreyer. 2019. Democratizing ai. Journal of the American College of Radiology 16, 7 (2019), 961–963

work page 2019
[15]

Firas Almukhtar, Nawzad Mahmoodd, and Shahab Kareem. 2021. Search engine optimization: a review. Applied computer science 17, 1 (2021), 70–80

work page 2021
[16]

Gabriel Alon and Michael Kamfonas. 2023. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132 (2023). https://doi.org/10. 48550/ARXIV.2308.14132 arXiv:2308.14132

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Srivastava, and Kai-Wei Chang

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi ...

work page doi:10.18653/v1/d18-1316 2018
[18]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. CoRR abs/1606.06565 (2016). arXiv:1606.06565 http://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Markus Anderljung, Joslyn Barnhart, Jade Leung, Anton Korinek, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, et al. 2023. Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718 (2023)

work page arXiv 2023
[20]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Is...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[21]

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10403 2023
[22]

AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024)

work page 2024
[23]

Marianna Apidianaki and Aina Garí Soler. 2021. ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns’ Semantic Properties and their Prototypicality. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2021, Punta Cana, Dominican Republic, November 11, 2021 , Jasmij...

work page doi:10.18653/v1/2021.blackboxnlp-1.7 2021
[24]

Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant Risk Minimization. CoRR abs/1907.02893 (2019). arXiv:1907.02893 http://arxiv.org/abs/1907.02893

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Stuart Armstrong. 2010. Utility indifference. (2010)

work page 2010
[26]

Stuart Armstrong. 2015. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 25, 2015 (AAAI Technical Report, Vol. WS-15-02) , Toby Walsh (Ed.). http://aaai.org/ocs/index.php/WS/AAAIW15/paper/ view/10183

work page 2015
[27]

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. Thinking Inside the Box: Controlling and Using an Oracle AI. Minds Mach. 22, 4 (2012), 299–324. https://doi.org/10.1007/S11023-012-9282-2

work page doi:10.1007/s11023-012-9282-2 2012
[28]

Anupam Arora, Rahul Telang, and Hong Xu. 2021. Do Data Breaches Damage Reputation? Evidence from 45 Cases. Journal of Cybersecurity 7, 1 (2021). https://academic.oup.com/cybersecurity/article/7/1/tyab021/6362163

work page 2021
[29]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https: //arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. CoRR...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). arXiv:1607.06450 http: //arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

James Babcock, János Kramár, and Roman Yampolskiy. 2016. The AGI Containment Problem. In Artificial General Intelligence - 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9782) , Bas R. Steunebrink, Pei Wang, and Ben Goertzel (Eds.). 53–63. https://doi.org/10.1007/978-3-319-...

work page doi:10.1007/978-3-319-41649-6_6 2016
[33]

James Babcock, Janos Kramar, and Roman V Yampolskiy. 2019. Guidelines for artificial intelligence containment. (2019)

work page 2019
[34]

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490 (2023)

work page arXiv 2023
[35]

Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings.CoRR abs/2308.11804 (2023). https://doi.org/10.48550/ARXIV.2308.11804 arXiv:2308.11804

work page doi:10.48550/arxiv.2308.11804 2023
[36]

Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, and Liang Zhao. 2024. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. CoRR abs/2401.00625 (2024). https://doi.org/10.48550/ARXIV.2401.00625 arXiv:2401.00625

work page doi:10.48550/arxiv.2401.00625 2024
[37]

Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. 2021. Recent Advances in Adversarial Training for Adversarial Robustness. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , Zhi-Hua Zhou (Ed.). 4312–4321. https://doi.org/10.24963/IJCAI.2021/591

work page doi:10.24963/ijcai.2021/591 2021
[38]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
[39]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
[40]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image Hijacks: Adversarial Images can Control Generative Models at Runtime. CoRR abs/2309.00236 (2023). https://doi.org/10.48550/ARXIV.2309.00236 arXiv:2309.00236

work page doi:10.48550/arxiv.2309.00236 2023
[41]

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019
[42]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

work page 2023
[43]

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. 2024. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. CoRR abs/2403.18932 (2024). https://doi.org/10.48550/ARXIV.2403.18932 arXiv:2403.18932

work page doi:10.48550/arxiv.2403.18932 2024
[44]

Hritik Bansal, Fan Yin, Nishad Singhi, Aditya Grover, Yu Yang, and Kai-Wei Chang. 2023. CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 . 112–123. https://doi.org/10.1109/ICCV51070.2023.00017

work page doi:10.1109/iccv51070.2023.00017 2023
[45]

Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. 2021. RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (V...

work page doi:10.18653/v1/2021.acl-long.151 2021
[46]

Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. 2022. Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps. CoRR abs/2204.11073 (2022). https://doi.org/10.48550/ARXIV.2204.11073 arXiv:2204.11073

work page doi:10.48550/arxiv.2204.11073 2022
[47]

Vita Santa Barletta, Danilo Caivano, Domenico Gigante, and Azzurra Ragone. 2023. A Rapid Review of Responsible AI frameworks: How to guide the development of ethical AI. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023, Oulu, Finland, June 14-16, 2023 . 358–367. https://doi.org/10.1145/359...

work page doi:10.1145/3593434.3593478 2023
[48]

Dipto Barman, Ziyi Guo, and Owen Conlan. 2024. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications (2024), 100545

work page 2024
[49]

Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias. CoRR abs/2010.14534 (2020). arXiv:2010.14534 https://arxiv.org/abs/2010.14534

work page arXiv 2020
[50]

Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. 2022. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United...

work page doi:10.18653/v1/2022.naacl-main.275 2022
[51]

Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2019. Identifying and Controlling Important Neurons in Neural Machine Translation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019
[52]

https://openreview.net/forum?id=H1z-PsR5KX

work page
[53]

Seth D Baum. 2023. Assessing natural global catastrophic risks. Natural Hazards 115, 3 (2023), 2699–2719. https://doi.org/10.1007/s11069-022- 05660-w Epub 2022 Oct 12. PMID: 36245947; PMCID: PMC9553633

work page doi:10.1007/s11069-022- 2023
[54]

Tobias Baumann. 2018. Why I expect successful (narrow) alignment. https://s-risks.org/why-i-expect-successful-alignment/

work page 2018
[56]

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, ICWSM 2020, Held Virtually, Original Venue: Atlanta, Georgia, USA, June 8-11, 2020, Munmun De Choudhury, Rumi Chunara, Aron Culotta, and Brooke Fo...

work page 2020
[57]

Mika Beckerich, Laura Plein, and Sergio Coronado. 2023. RatGPT: Turning online LLMs into Proxies for Malware Attacks. CoRR abs/2308.09183 (2023). https://doi.org/10.48550/ARXIV.2308.09183 arXiv:2308.09183

work page doi:10.48550/arxiv.2308.09183 2023
[58]

Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December...

work page 2017
[59]

James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. 2020. Secure single-server aggregation with (poly) logarithmic overhead. In ACM SIGSAC Conference on Computer and Communications Security . 1253–1269

work page 2020
[60]

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predictions from Transformers with the Tuned Lens. CoRR abs/2303.08112 (2023). https://doi.org/10.48550/ARXIV.2303.08112 arXiv:2303.08112

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023
[61]

Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise

work page 2023
[62]

Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety - A Review. CoRR abs/2404.14082 (2024). https: //doi.org/10.48550/ARXIV.2404.14082 arXiv:2404.14082

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14082 2024
[63]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...

work page doi:10.1609/aaai.v38i16.29720 2024
[64]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662 (2023). https://doi.org/10.48550/ARXIV.2308.09662 arXiv:2308.09662 ACM Comput. Surv. AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions 59

work page doi:10.48550/arxiv.2308.09662 2023
[65]

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2023. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. CoRR abs/2309.07875 (2023). https://doi.org/10.48550/ ARXIV.2309.07875 arXiv:2309.07875

work page arXiv 2023
[66]

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2017. Evasion Attacks against Machine Learning at Test Time. CoRR abs/1708.06131 (2017). arXiv:1708.06131 http://arxiv.org/abs/1708.06131

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

work page 2023
[68]

Teemu Birkstedt, Matti Minkkinen, Anushree Tandon, and Matti Mäntymäki. 2023. AI governance: themes, knowledge gaps and future agendas. Internet Research 33, 7 (2023), 133–167

work page 2023
[69]

Zou, Venkatesh Saligrama, and Adam Tauman Kalai

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain , Daniel D. Lee, M...

work page 2016
[70]

Nick Bostrom. 2002. Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and technology 9 (2002)

work page 2002
[71]

Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies

work page 2014
[72]

Djamila Bouhata and Hamouma Moumen. 2022. Byzantine Fault Tolerance in Distributed Machine Learning : a Survey. CoRR abs/2205.02572 (2022). https://doi.org/10.48550/ARXIV.2205.02572 arXiv:2205.02572

work page doi:10.48550/arxiv.2205.02572 2022
[73]

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy . 141–159

work page 2021
[74]

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Ka...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.03540 2022
[75]

Boyd and Angelos D

Stephen W. Boyd and Angelos D. Keromytis. 2004. SQLrand: Preventing SQL Injection Attacks. In Applied Cryptography and Network Security, Second International Conference, ACNS 2004, Yellow Mountain, China, June 8-11, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3089) , Markus Jakobsson, Moti Yung, and Jianying Zhou (Eds.). 292–302. https://do...

work page doi:10.1007/978-3-540-24852-1_21 2004
[76]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. CoRR abs/2209.02128 (2022). https://doi.org/10.48550/ARXIV.2209.02128 arXiv:2209.02128

work page doi:10.48550/arxiv.2209.02128 2022
[77]

CSET Policy Brief. 2021. AI and the Future of Disinformation Campaigns. Center Secur. Emerg. Technol., Georgetown Univ., Washington, DC, USA, Tech. Rep (2021)

work page 2021
[78]

Blake Brittain. 2023. Pulitzer-winning authors join OpenAI, Microsoft copyright lawsuit. https://www.reuters.com/legal/pulitzer-winning- authors-join-openai-microsoft-copyright-lawsuit-2023-12-20/

work page 2023
[79]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR abs/1606.01540 (2016). arXiv:1606.01540 http://arxiv.org/abs/1606.01540

work page internal anchor Pith review Pith/arXiv arXiv 2016
[80]

Clarence Ng David Schnurr Eric Luhman Joe Taylor Li Jing Natalie Summers Ricky Wang Rohan Sahai Ryan O’Rourke Troy Luhman Will DePue Yufei Guo Connor Holmes Bill Peebles Tim Brooks. 2024. Creating video from text. (2024). https://doi.org/10.48550/arXiv.2402.17177

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17177 2024
[81]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020

Showing first 80 references.

[1] [1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security . 308–318. 45https://openai.com/index/introducing-superalignment/ ACM Comput. Surv. 56 C. Chen et al

work page 2016

[2] [2]

Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series, Vol. 69) , Carla E. Brodley (Ed.). https://doi.org/10.1145/1015330.1015430

work page doi:10.1145/1015330.1015430 2004

[3] [3]

Sahar Abdelnabi, Kai Greshake, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec 2023, Copenhagen, Denmark, 30 November 2023 , Maura ...

work page doi:10.1145/3605764.3623985 2023

[4] [4]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021 , Marion Fourcade, Benjamin Kuipers, Seth Lazar, and Deirdre K. Mulligan (Eds.). 298–306. https://doi.org/10.1145/3461702.3462624

work page doi:10.1145/3461702.3462624 2021

[5] [5]

Adams, Tyler Cody, and Peter A

Stephen C. Adams, Tyler Cody, and Peter A. Beling. 2022. A survey of inverse reinforcement learning. Artif. Intell. Rev. 55, 6 (2022), 4307–4346. https://doi.org/10.1007/S10462-021-10108-X

work page doi:10.1007/s10462-021-10108-x 2022

[6] [6]

Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. 2023. Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. CoRR abs/2311.01463 (2023). https://doi.org/10.48550/ARXIV.2311.01463 arXiv:2311.01463

work page doi:10.48550/arxiv.2311.01463 2023

[7] [7]

Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021 , Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 533–549. https://do...

work page doi:10.18653/v1/2021.emnlp-main.42 2021

[8] [8]

Guardrail AI. 2023. Build AI powered applications with confidence. https://www.guardrailsai.com/

work page 2023

[9] [9]

NIST AI. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0). (2023)

work page 2023

[10] [10]

Ulrich Aïvodji, Alexandre Bolot, and Sébastien Gambs. 2020. Model extraction from counterfactual explanations. arXiv preprint arXiv:2009.01884 (2020)

work page arXiv 2020

[11] [11]

Renat Aksitov, Chung-Ching Chang, David Reitter, Siamak Shakeri, and Yun-Hsuan Sung. 2023. Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models. CoRR abs/2302.05578 (2023). https://doi.org/10.48550/ARXIV.2302.05578 arXiv:2302.05578

work page doi:10.48550/arxiv.2302.05578 2023

[12] [12]

Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15, 2 (2023)

work page 2023

[13] [13]

Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the 2016 election. Journal of economic perspectives 31, 2 (2017), 211–236

work page 2017

[14] [14]

Bibb Allen, Sheela Agarwal, Jayashree Kalpathy-Cramer, and Keith Dreyer. 2019. Democratizing ai. Journal of the American College of Radiology 16, 7 (2019), 961–963

work page 2019

[15] [15]

Firas Almukhtar, Nawzad Mahmoodd, and Shahab Kareem. 2021. Search engine optimization: a review. Applied computer science 17, 1 (2021), 70–80

work page 2021

[16] [16]

Gabriel Alon and Michael Kamfonas. 2023. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132 (2023). https://doi.org/10. 48550/ARXIV.2308.14132 arXiv:2308.14132

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Srivastava, and Kai-Wei Chang

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi ...

work page doi:10.18653/v1/d18-1316 2018

[18] [18]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. CoRR abs/1606.06565 (2016). arXiv:1606.06565 http://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Markus Anderljung, Joslyn Barnhart, Jade Leung, Anton Korinek, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, et al. 2023. Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718 (2023)

work page arXiv 2023

[20] [20]

Gemini: A Family of Highly Capable Multimodal Models

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Is...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023

[21] [21]

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10403 2023

[22] [22]

AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024)

work page 2024

[23] [23]

Marianna Apidianaki and Aina Garí Soler. 2021. ALL Dolphins Are Intelligent and SOME Are Friendly: Probing BERT for Nouns’ Semantic Properties and their Prototypicality. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2021, Punta Cana, Dominican Republic, November 11, 2021 , Jasmij...

work page doi:10.18653/v1/2021.blackboxnlp-1.7 2021

[24] [24]

Martín Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant Risk Minimization. CoRR abs/1907.02893 (2019). arXiv:1907.02893 http://arxiv.org/abs/1907.02893

work page internal anchor Pith review Pith/arXiv arXiv 2019

[25] [25]

Stuart Armstrong. 2010. Utility indifference. (2010)

work page 2010

[26] [26]

Stuart Armstrong. 2015. Motivated Value Selection for Artificial Agents. In Artificial Intelligence and Ethics, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 25, 2015 (AAAI Technical Report, Vol. WS-15-02) , Toby Walsh (Ed.). http://aaai.org/ocs/index.php/WS/AAAIW15/paper/ view/10183

work page 2015

[27] [27]

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. 2012. Thinking Inside the Box: Controlling and Using an Oracle AI. Minds Mach. 22, 4 (2012), 299–324. https://doi.org/10.1007/S11023-012-9282-2

work page doi:10.1007/s11023-012-9282-2 2012

[28] [28]

Anupam Arora, Rahul Telang, and Hong Xu. 2021. Do Data Breaches Damage Reputation? Evidence from 45 Cases. Journal of Cybersecurity 7, 1 (2021). https://academic.oup.com/cybersecurity/article/7/1/tyab021/6362163

work page 2021

[29] [29]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https: //arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Yitzhak Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. 2023. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. CoRR...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). arXiv:1607.06450 http: //arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

James Babcock, János Kramár, and Roman Yampolskiy. 2016. The AGI Containment Problem. In Artificial General Intelligence - 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9782) , Bas R. Steunebrink, Pei Wang, and Ben Goertzel (Eds.). 53–63. https://doi.org/10.1007/978-3-319-...

work page doi:10.1007/978-3-319-41649-6_6 2016

[33] [33]

James Babcock, Janos Kramar, and Roman V Yampolskiy. 2019. Guidelines for artificial intelligence containment. (2019)

work page 2019

[34] [34]

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490 (2023)

work page arXiv 2023

[35] [35]

Eugene Bagdasaryan and Vitaly Shmatikov. 2023. Ceci n’est pas une pomme: Adversarial Illusions in Multi-Modal Embeddings.CoRR abs/2308.11804 (2023). https://doi.org/10.48550/ARXIV.2308.11804 arXiv:2308.11804

work page doi:10.48550/arxiv.2308.11804 2023

[36] [36]

Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, and Liang Zhao. 2024. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. CoRR abs/2401.00625 (2024). https://doi.org/10.48550/ARXIV.2401.00625 arXiv:2401.00625

work page doi:10.48550/arxiv.2401.00625 2024

[37] [37]

Tao Bai, Jinqi Luo, Jun Zhao, Bihan Wen, and Qian Wang. 2021. Recent Advances in Adversarial Training for Adversarial Robustness. InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 , Zhi-Hua Zhou (Ed.). 4312–4321. https://doi.org/10.24963/IJCAI.2021/591

work page doi:10.24963/ijcai.2021/591 2021

[38] [38]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022

[39] [39]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022

[40] [40]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image Hijacks: Adversarial Images can Control Generative Models at Runtime. CoRR abs/2309.00236 (2023). https://doi.org/10.48550/ARXIV.2309.00236 arXiv:2309.00236

work page doi:10.48550/arxiv.2309.00236 2023

[41] [41]

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607

work page doi:10.1109/tpami.2018.2798607 2019

[42] [42]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

work page 2023

[43] [43]

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. 2024. Measuring Political Bias in Large Language Models: What Is Said and How It Is Said. CoRR abs/2403.18932 (2024). https://doi.org/10.48550/ARXIV.2403.18932 arXiv:2403.18932

work page doi:10.48550/arxiv.2403.18932 2024

[44] [44]

Hritik Bansal, Fan Yin, Nishad Singhi, Aditya Grover, Yu Yang, and Kai-Wei Chang. 2023. CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023 . 112–123. https://doi.org/10.1109/ICCV51070.2023.00017

work page doi:10.1109/iccv51070.2023.00017 2023

[45] [45]

Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. 2021. RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (V...

work page doi:10.18653/v1/2021.acl-long.151 2021

[46] [46]

Oren Barkan, Edan Hauon, Avi Caciularu, Ori Katz, Itzik Malkiel, Omri Armstrong, and Noam Koenigstein. 2022. Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps. CoRR abs/2204.11073 (2022). https://doi.org/10.48550/ARXIV.2204.11073 arXiv:2204.11073

work page doi:10.48550/arxiv.2204.11073 2022

[47] [47]

Vita Santa Barletta, Danilo Caivano, Domenico Gigante, and Azzurra Ragone. 2023. A Rapid Review of Responsible AI frameworks: How to guide the development of ethical AI. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, EASE 2023, Oulu, Finland, June 14-16, 2023 . 358–367. https://doi.org/10.1145/359...

work page doi:10.1145/3593434.3593478 2023

[48] [48]

Dipto Barman, Ziyi Guo, and Owen Conlan. 2024. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications (2024), 100545

work page 2024

[49] [49]

Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias. CoRR abs/2010.14534 (2020). arXiv:2010.14534 https://arxiv.org/abs/2010.14534

work page arXiv 2020

[50] [50]

Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, and Douwe Kiela. 2022. Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United...

work page doi:10.18653/v1/2022.naacl-main.275 2022

[51] [51]

Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2019. Identifying and Controlling Important Neurons in Neural Machine Translation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

work page 2019

[52] [52]

https://openreview.net/forum?id=H1z-PsR5KX

work page

[53] [53]

Seth D Baum. 2023. Assessing natural global catastrophic risks. Natural Hazards 115, 3 (2023), 2699–2719. https://doi.org/10.1007/s11069-022- 05660-w Epub 2022 Oct 12. PMID: 36245947; PMCID: PMC9553633

work page doi:10.1007/s11069-022- 2023

[54] [54]

Tobias Baumann. 2018. Why I expect successful (narrow) alignment. https://s-risks.org/why-i-expect-successful-alignment/

work page 2018

[55] [56]

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit Dataset. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media, ICWSM 2020, Held Virtually, Original Venue: Atlanta, Georgia, USA, June 8-11, 2020, Munmun De Choudhury, Rumi Chunara, Aron Culotta, and Brooke Fo...

work page 2020

[56] [57]

Mika Beckerich, Laura Plein, and Sergio Coronado. 2023. RatGPT: Turning online LLMs into Proxies for Malware Attacks. CoRR abs/2308.09183 (2023). https://doi.org/10.48550/ARXIV.2308.09183 arXiv:2308.09183

work page doi:10.48550/arxiv.2308.09183 2023

[57] [58]

Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James R. Glass. 2017. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December...

work page 2017

[58] [59]

James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. 2020. Secure single-server aggregation with (poly) logarithmic overhead. In ACM SIGSAC Conference on Computer and Communications Security . 1253–1269

work page 2020

[59] [60]

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predictions from Transformers with the Tuned Lens. CoRR abs/2303.08112 (2023). https://doi.org/10.48550/ARXIV.2303.08112 arXiv:2303.08112

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2023

[60] [61]

Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise

work page 2023

[61] [62]

Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety - A Review. CoRR abs/2404.14082 (2024). https: //doi.org/10.48550/ARXIV.2404.14082 arXiv:2404.14082

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14082 2024

[62] [63]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...

work page doi:10.1609/aaai.v38i16.29720 2024

[63] [64]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662 (2023). https://doi.org/10.48550/ARXIV.2308.09662 arXiv:2308.09662 ACM Comput. Surv. AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions 59

work page doi:10.48550/arxiv.2308.09662 2023

[64] [65]

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2023. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. CoRR abs/2309.07875 (2023). https://doi.org/10.48550/ ARXIV.2309.07875 arXiv:2309.07875

work page arXiv 2023

[65] [66]

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2017. Evasion Attacks against Machine Learning at Test Time. CoRR abs/1708.06131 (2017). arXiv:1708.06131 http://arxiv.org/abs/1708.06131

work page internal anchor Pith review Pith/arXiv arXiv 2017

[66] [67]

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

work page 2023

[67] [68]

Teemu Birkstedt, Matti Minkkinen, Anushree Tandon, and Matti Mäntymäki. 2023. AI governance: themes, knowledge gaps and future agendas. Internet Research 33, 7 (2023), 133–167

work page 2023

[68] [69]

Zou, Venkatesh Saligrama, and Adam Tauman Kalai

Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain , Daniel D. Lee, M...

work page 2016

[69] [70]

Nick Bostrom. 2002. Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and technology 9 (2002)

work page 2002

[70] [71]

Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies

work page 2014

[71] [72]

Djamila Bouhata and Hamouma Moumen. 2022. Byzantine Fault Tolerance in Distributed Machine Learning : a Survey. CoRR abs/2205.02572 (2022). https://doi.org/10.48550/ARXIV.2205.02572 arXiv:2205.02572

work page doi:10.48550/arxiv.2205.02572 2022

[72] [73]

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In IEEE Symposium on Security and Privacy . 141–159

work page 2021

[73] [74]

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosiute, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Ka...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.03540 2022

[74] [75]

Boyd and Angelos D

Stephen W. Boyd and Angelos D. Keromytis. 2004. SQLrand: Preventing SQL Injection Attacks. In Applied Cryptography and Network Security, Second International Conference, ACNS 2004, Yellow Mountain, China, June 8-11, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3089) , Markus Jakobsson, Moti Yung, and Jianying Zhou (Eds.). 292–302. https://do...

work page doi:10.1007/978-3-540-24852-1_21 2004

[75] [76]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. 2022. Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. CoRR abs/2209.02128 (2022). https://doi.org/10.48550/ARXIV.2209.02128 arXiv:2209.02128

work page doi:10.48550/arxiv.2209.02128 2022

[76] [77]

CSET Policy Brief. 2021. AI and the Future of Disinformation Campaigns. Center Secur. Emerg. Technol., Georgetown Univ., Washington, DC, USA, Tech. Rep (2021)

work page 2021

[77] [78]

Blake Brittain. 2023. Pulitzer-winning authors join OpenAI, Microsoft copyright lawsuit. https://www.reuters.com/legal/pulitzer-winning- authors-join-openai-microsoft-copyright-lawsuit-2023-12-20/

work page 2023

[78] [79]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR abs/1606.01540 (2016). arXiv:1606.01540 http://arxiv.org/abs/1606.01540

work page internal anchor Pith review Pith/arXiv arXiv 2016

[79] [80]

Clarence Ng David Schnurr Eric Luhman Joe Taylor Li Jing Natalie Summers Ricky Wang Rohan Sahai Ryan O’Rourke Troy Luhman Will DePue Yufei Guo Connor Holmes Bill Peebles Tim Brooks. 2024. Creating video from text. (2024). https://doi.org/10.48550/arXiv.2402.17177

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.17177 2024

[80] [81]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020